Abstract
Comparison of 317 gene pairs in human and mouse that were duplicated after the most recent common ancestor of the two species was used to search for candidates that may have undergone functional differentiation. Even when corrected for multiple tests, Tajima’s relative rate test showed significant rate differences in 36% of cases for which the test was applicable. However, a significant result in this case was increasingly likely as the sequence length increased; thus, a statistically significant result of a relative rate test may not be biologically meaningful. We used regression methods to provide more robust methods of testing for functionally differentiated gene pairs, which take into account the variation in the entire data set by examination of residuals from regression-identified gene pairs with unusually high nonsynonymous divergence from a reference sequence and from each other. This approach identified six duplicate gene pairs that appeared to be candidates for functional differentiation as a result of positive Darwinian selection.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Gene duplication, which can give rise to new genes encoding proteins with new functions, is believed to have played an important role in the evolutionary diversification of organisms (Nei 1969; Ohno 1970; Li 1982; Hughes 1994). There is evidence that gene duplication occurs continually over evolutionary time (Lynch and Conery 2000; Friedman and Hughes 2003). Only certain duplicate genes are actually retained in the genome, while the others are eventually lost. If a duplicate gene can assume a new function beneficial to the organism, it is more likely that it will be retained (Hughes 1994; Lynch et al. 2001). Thus, positive Darwinian selection may frequently be involved in the fixation of duplicate genes that have undergone beneficial mutations (Hughes 1999a).
It has often been difficult to obtain evidence that positive selection has acted on mutations leading to the functional differentiation of duplicate genes. A useful approach to testing for positive selection involves comparing the number of synonymous nucleotide substitutions per synonymous site (d S ) with the number of nonsynonymous nucleotide substitutions per nonsynonymous site (d N ) (Hughes and Nei 1988). In a number of cases, this approach has provided evidence of positive selection diversifying duplicate genes at the amino acid level (e.g., Hill and Hastie 1987; Tanaka and Nei 1989; Hughes 1999b, 2002; Hughes et al. 2000).
However, it seems unlikely that this approach will be able to detect positive selection in many cases involving multigene families. First, since positive selection is likely to be focused only on certain functional regions of the protein, this approach works best in cases in which structural and functional information is available (Hughes 1999a). Moreover, positive selection favoring specialization of duplicate genes may typically occur over a short time frame. Once the proteins encoded by two duplicate genes have become specialized for distinct functions, new amino acid changes may no longer be favored (Hughes 1999a). If so, purifying selection will again predominate, and eventually d S will overtake d N . There is evidence of such an evolutionary process in plots of d N vs. d S in pairwise comparisons among members of a variety of gene families. In such plots, d N often exceeds d S when d S is low, and thus the genes compared have a recent common ancestor, but d S exceeds d N in more distant comparisons (e.g., Tanaka and Nei 1989; Hughes et al. 2000).
An additional correlate of functional divergence between two duplicated genes might be inequality (asymmetry) of the rates of nonsynonymous substitution in the two genes; such a pattern might indicate that one of the two genes has adopted a new function, whereas the other gene has retained a function closer to that of the ancestral gene. Some recent studies have made use of genomic data to survey for nonsynonymous rate asymmetry between duplicate genes. Kondrashov et al. (2002), in a study of 101 paralogs, found that significant nonsynonymous rate asymmetry occurred in only 5 cases. On the other hand, in analyses of Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Drosophila melanogaster, Conant and Wagner (2003) found significant nonsynonymous rate asymmetry in 22 of 80 duplicate gene pairs. Furthermore, Zhang et al. (2003) found significant rate asymmetry at the amino acid level in 145 of 250 human duplicate gene pairs. Although all of these authors used different methods to test for rate asymmetry, none applied any correction for multiple tests. Thus, the true significance levels in these studies remain unclear.
Here we study pairs of paralogous genes in the human and mouse genomes that have arisen by gene duplication since the most recent common ancestor of the two species (estimated to have occurred about 110 million years ago; Kumar and Hedges 1998). We employ robust approaches to test for rate asymmetry between paralog, with an emphasis on methods that take into control for multiple testing. We apply simple regression-based methods that take into account the variability in the entire data set, with the goal of identifying gene pairs likely to have diversified functionally as a result of positive selection.
Methods
Sequence Data
The genomic data for human (version 16.33) and mouse (version 16.30) were obtained from Ensembl (http://www.ensembl.org). The version numbers refer to the database software (Ensembl version 16) and the assembly of the genomic sequences (NCBI versions 33 and 30). Genes predicted by Ensembl have been curated and verified by similarity with homologs discovered experimentally (Clamp et al. 2003). The numbers of annotated protein-coding genes were 32,035 for human and 32,911 for mouse. After removal of genes that were shorter than 100 bases and longer than 300,000 bases, the total gene sets were 29,606 in human and 32,296 in mouse. Further curation to remove overlapping loci resulted in a count of 20,387 genes in human and 23,222 genes in mouse.
Protein families were identified by homology and a single-linkage method employed by the BLASTCLUST software available in the Blast software package (Altschul et al 1997). Sequence homology was established by identifying matches using a conservative E-value of 10−6 with a minimum of 30% sequence identity across at least 50% of the length of two sequences. The single-linkage method assembles larger families by linking shared genes among families, thus ensuring that a given gene will be assigned to only one family. To identify recent duplicates, we chose families with exactly three members and at least one member from each of the two species. From these families, we selected those cases in which the number of synonymous substitutions per synonymous site (d S ) between the two conspecific genes was lower than that for either between-species comparison.
There is evidence that even recently duplicated genes can be chimeras as a result of exon shuffling (Katju and Lynch 2003). In order to rule out chimeric genes with marked differences between regions with respect to the extent of sequence divergence, we computed the proportion of amino acid difference in a window of 30 aligned amino acid residues along each pair of paralogs. One pair of paralogs showed a strong difference in sequence similarity between N-terminal and C-terminal regions, and this pair was found to correspond to a known chimeric gene (Paulding et al. 2003). Therefore, this gene family was excluded from the analysis. The resulting data set contained 316 families in which a gene duplication occurred in human (119 families) or in mouse (197 families) after the last common ancestor of human and mouse. The data are available from http://www.biol.sc.edu/∼austin/.
Statistical Analyses
Homologous sequences were aligned at the amino acid level using the CLUSTAL W program (Thompson et al. 1994), and this alignment was imposed on the DNA sequences. The number of synonymous nucleotide substitutions per synonymous site (d S ) and the number of nonsynonymous nucleotide substitutions per nonsynonymous site (d N ) were estimated by a maximum likelihood method (Yang and Nielsen 2000) using the software package PAML (Yang 1997). We used Tajima’s (1993) method to test the hypothesis that duplicate gene pairs evolved at equal rates at the amino acid, i.e., to test for rate asymmetry at the amino acid level. This test has an advantage over some other methods that have been used for such relative-rate tests because it is not model-dependent (Tajima 1993). However, the test statistic could not be computed in 32 of 317 families, either because the amino acid sequences were too similar or because they were too divergent. We also used approaches to identifying rate asymmetry based on linear regression, which have the advantage of taking into account stochastic error in the entire data set.
Results
Relative Rate Tests
Tajima’s (1993) relative rate test statistic could be computed for 284 duplicate gene pairs. Using a Bonferroni-corrected simultaneous significance level, 102 (35.9%) pairs showed a significant rate asymmetry in amino acid sequence evolution at the 5% level. The proportions of gene pairs showing rate asymmetry were similar for the two species: 41 of 109 (37.6%) in human and 60 of 175 (34.3%) in mouse. At the 1% level (Bonferroni-corrected), 81 of 284 (28.5%) gene pairs showed significant rate asymmetry, 35 of 109 (32.1%) in human and 46 of 175 (26.3%) in mouse. Thus despite the use of a conservative statistical test, our results showed a high frequency of rate asymmetry, comparable to other studies (Conant and Wagner 2003; Zhang et al. 2003).
In order to understand the factors contributing to a significant relative rate test, we compared duplicate gene pairs with statistically significant (at the 5% level) evidence of asymmetry in the rate of amino acid evolution with those showing no evidence of asymmetry. We found that gene pairs with significant asymmetry encoded significantly longer polypeptides on average (Table 1). The mean length of pairs with significant asymmetry was 354.5 residues (rang, 102–1134), while the mean length of pairs without significant asymmetry was 253.9 (range 52–962). A plausible explanation for the difference in mean length between the two groups is that the power of Tajima’s (1993) test to detect a rate difference increases as the number of sites increases. Pairs with significant asymmetry also had greater mean values of both the number of synonymous substitutions per synonymous site (d S ) and the number of nonsynonymous substitutions per nonsynonymous site (d N ) (Table 1). These results also are best explained as reflecting the power of the test to detect rate differences, which is likely to increase as the number of differences between the sequences increases.
Thus, these results suggested that evidence of a significant rate asymmetry was largely a function of the statistical power to detect a rate difference. As a consequence, applying such a test to individual gene pairs may not be the optimal approach if the goal is to identify cases of exceptional rate asymmetry, which may be indicative of functional divergence between duplicated genes. Instead, we used a number of approaches to identify duplicate gene pairs whose divergence at the amino acid level was unusually high in comparison to other gene pairs in the data set.
Regression Methods
The approach we chose involved conducting linear regression analyses and identifying outliers from the linear trend (as evidenced by high standardized residuals). First, we conducted a linear regression of d N between one duplicate and to the reference sequence (the other species) against d N between the other duplicate and the reference sequence. We refer to these values as d N1 and d N2 (Fig. 1). Because the order in which the two duplicates were compared to the reference was arbitrary, we used the absolute value of the standardized residual from this regression as an indicator of cases where the absolute difference between d N1 and d N2 was unusually large and thus there was asymmetry in the nonsynonymous rate. The mean absolute value of the standard residuals from the regression of d N1 vs. d N2 was significantly higher in cases where Tajima’s test showed significant rate asymmetry than in cases where Tajima’s test showed no rate asymmetry (Table 1). Thus, gene pairs showing high absolute values of the standard residuals from the regression of d N1 vs. d N2 seemed good candidates for functional divergence between duplicates. There were 14 cases (9 in human 5 in mouse) with standard residuals ≥2.0 in absolute value (Table 2).
In addition, we conducted regression of d N1/d S1 vs. d N2/d S2 and examined the absolute values of the standard residuals. In this analysis also, the mean absolute value of the standard residual was higher in cases where the Tajima test showed significant rate asymmetry than in cases where that test did not show significant rate asymmetry (Table 1).
In order to determine which factors were most associated with significant evidence of rate asymmetry in Tajima’s (1993) test, we computed partial correlations between each of the variables whose means are summarized in Table 1 and the chi-square statistic for Tajima’s test (Table 3). Only two variables showed significant partial correlations with the chi-square statistic when controlling simultaneously for all other variables: the number of codons in the sequence and the absolute value of the standard residual from the regression of d N1 vs. d N2 (Table 3). The significant partial correlation in the case of the latter variable showed that this variable is associated with a significant result of Tajima’s (1993) test independent of the increase in power of that test as a function of increased sequence length and increased sequence divergence.
Synonymous and Nonsynonymous Substitutions
Figure 2 illustrates the number of synonymous substitutions per synonymous site (d S ) and the number of nonsynonymous substitutions per nonsynonymous site (d N ) in comparisons between duplicate pairs. Overall, mean d S (0.139 ± 0.011 SE) was significantly greater than mean d N (0.073 ± 0.006) (paired t-test; p < 0.001). However, in 36 (11.4%) of 316 gene pairs, d N exceeded ds, while in 14 gene pairs (4.4%) d N and d S were equal. Only in two cases did d N exceed d S significantly (at the 5% level) by the widely used z-test. In both of these cases no synonymous substitutions were observed. The two gene pairs involved were the following: (1) a gene pair encoding proteins of unknown function (Ensembl I.D. ENSP00000295653 and ENSP00000303012) with d N = 0.0077 ± 0.0031 and (2) two genes encoding proteins related to pro-melanin-concentrating hormone (Ensembl I.D. ENSP00000323682 and ENSP00000295326) with d N = 0.0306 ± 0.0126. Because of the lack of synonymous substitutions, both of these cases evidently represented very recent duplicates. Also, it is worth pointing out that if the z-test results were corrected for multiple tests, the z-test would no longer be significant in either of these cases.
In order to identify other cases with unusually high d N with respect to d S , we searched for large positive standard residuals from the regression of d N vs. d S (Fig. 2). There were 10 gene pairs (5 from human, 5 from mouse) in which the standard residual was ≥2.0 (Table 4). In two of these pairs, d N was greater than d S although not significantly so by the z-test (Table 4). Furthermore, six of these gene pairs also showed unusually high absolute values of the standard residual from the regression of d N1 vs. d N2 (Tables 2 and 4). These six gene pairs thus show both unusually high d N relative to d S , as indicated by high standard residuals (Table 4), and strong nonsynonymous rate asymmetry. As a consequence, these six gene pairs seem to be good initial candidates for duplicates that have diverged functionally.
Discussion
Relative rate tests (Wu and Li 1985; Tajima 1993; Takezaki et al. 1995) have been widely used to test the hypothesis of equality of the rate of molecular evolution between sequences (or groups of sequences) by comparison to an outgroup or reference. Such tests typically rely on the assumption that all sites evolve independently. As a consequence, it is expected that, as the number of sites examined increases, the power of the test to detect rate asymmetry will increase. However, such small rate differences may actually results from stochastic error and thus may not be biologically meaningful. Consistent with the theoretical prediction that the power of these tests increases as the number of sites examined increases, we found that the number of sites examined was a good predictor of statistical significance in Tajima’s (1993) test, even when a very conservative correction for multiple testing was applied (Tables 1 and 3).
In studies whose goal is to identify cases where duplicate gene pairs may have diverged as a result of adaptation to distinct functions, it seems preferable to use approaches that take into account the variance across gene pairs. In the present paper, we analyzed data on recently duplicated gene pairs in mammals using robust approaches based on linear regression. We show that these approaches can be used to search for gene pairs whose divergence at the amino acid level is unusually high in comparison to others in the data set.
One of these approaches is based on the absolute values of the standard residuals from the regression of d N in the comparison of one duplicate gene with the reference sequence against d N in the comparison of the other duplicate gene with the reference sequence. Partial correlation analysis showed that the absolute value of the standard residuals from this regression was significantly correlated with the chi-square statistic in Tajima’s (1993) test, independent of the effect of sequence length (Table 3). An additional approach was based on the standard residuals from the regression of d N vs. d S between the two duplicated genes. Interestingly, five gene pairs were identified by both of these methods (Tables 2 and 4). These gene pairs seem the best candidates in our data set for functional diverged duplicate gene pairs.
Computation of d S and d N over the entire coding region of a gene can rarely provide a meaningful test of the hypothesis of positive Darwinian selection, because such selection typically acts only on a limited region involved in the function that is under selection (Hughes 1999a). In the present data set, only two gene pairs showed a significantly greater value of d N than d S for the entire gene by the commonly used z-test However, in both of these cases, no synonymous substitutions were observed, and d N was quite low. These cases may represent positive selection that occurred soon after gene duplication. On the other hand, the difference between d N than d S may be due to stochastic error. Additional information on the structure of the proteins encoded by these gene will be needed to definitively rule out the latter possibility.
Of the 10 cases in which the standard residuals from the regression of d N against d S were unusually large, d N exceeded d S in only 2, and in neither of these cases was the difference significant by the z-test (Table 4). On the other hand, these 10 cases were identified by a method that takes into account the variance in d S and d N over the entire data set. Such cases may actually be at least as plausible candidates for positive selection as the two cases in which d N of exceeded d S significantly by the z-test. In searching for cases of adaptive evolution at the molecular level, it may be preferable to employ a two-step procedure: (1) using regression of d N vs. d S , identify cases with an unusually high d N for a given d S ; (2) using structural information, identify functionally important regions of these molecules and compute d N and d S separately in each region. In the present data set, this approach identified six duplicate gene pairs as good candidates for functional divergence. Detailed structural information was lacking for these six duplicate pairs, but further application of this approach may uncover candidates with known structure and may eventually inspire structural studies on genes whose duplication and functional divergence may have played an important role in the evolution of biological processes.
References
SF Altschul TL Madden AA Schaffer J Zhang Z Zhang W Miller DJ Lipman (1997) ArticleTitleGapped BLAST and PSI- BLAST: A new generation of protein database search programs Nucleic Acids Res 25 3389–3402 Occurrence Handle1:CAS:528:DyaK2sXlvFyhu7w%3D Occurrence Handle9254694
M Clamp D Andrews D Barker P Bevan G Cameron Y Chen L Clark T Cox J Cuff V Curwen T Down R Durbin E Eyras J Gilbert M Hammond T Hubbard A Kasprzyk D Keefe H Lehvaslaiho V Iyer C Melsopp E Mongin R Pettett S Potter A Rust E Schmidt S Searle G Slater J Smith W Spooner A Stabenau J Stalker E Stupka A Ureta-Vidal I Vastrik E Birney (2003) ArticleTitleEnsembl 2002: Accommodating comparative genomics Nucleic Acids Res 31 38–42 Occurrence Handle10.1093/nar/gkg083 Occurrence Handle1:CAS:528:DC%2BD3sXhvFSgu7o%3D Occurrence Handle12519943
GC Conant A Wagner (2003) ArticleTitleAsymmetric sequence divergence of duplicate genes Genome Res 13 2052–2058 Occurrence Handle10.1101/gr.1252603 Occurrence Handle1:CAS:528:DC%2BD3sXnslKht7o%3D Occurrence Handle12952876
R Friedman AL Hughes (2003) ArticleTitleThe temporal distribution of gene duplication events in a set of highly conserved human gene families Mol Biol Evol 20 154–161 Occurrence Handle10.1093/molbev/msg017 Occurrence Handle1:CAS:528:DC%2BD3sXktlOqtg%3D%3D Occurrence Handle12519918
RE Hill ND Hastie (1987) ArticleTitleAccelerated evolution in the reactive center regions of serine protease inhibitors Nature 326 96–99 Occurrence Handle10.1038/326096a0 Occurrence Handle1:CAS:528:DyaL2sXktVWktb8%3D Occurrence Handle3493437
AL Hughes (1994) ArticleTitleThe evolution of functionally novel proteins after gene duplication Proc R Soc Lond B 256 119–124 Occurrence Handle1:CAS:528:DyaK2MXmsFSktw%3D%3D Occurrence Handle8029240
AL Hughes (1999a) Adaptive evolution of genes and genomes Oxford University Press New York
AL Hughes (1999b) ArticleTitleEvolutionary diversification of the mammalian defensins Cell Mol Life Sci 56 94–103 Occurrence Handle10.1007/s000180050010 Occurrence Handle1:CAS:528:DyaK1MXmslCmsb4%3D
AL Hughes (2002) ArticleTitleEvolution of the human killer cell inhibitory receptor family Mol Phylogenet Evol 25 330–340 Occurrence Handle10.1016/S1055-7903(02)00255-5 Occurrence Handle1:CAS:528:DC%2BD38Xot1Knurk%3D Occurrence Handle12414314
AL Hughes M Nei (1988) ArticleTitlePattern of nucleotide substitution at MHC class I genes reveals overdominant selection Nature 335 167–170 Occurrence Handle10.1038/335167a0 Occurrence Handle1:CAS:528:DyaL1cXlvVCktrY%3D Occurrence Handle3412472
AL Hughes JA Green JM Garbayo RM Roberts (2000) ArticleTitleAdaptive diversification within a large family of recently duplicated, placentally expressed genes Proc Natl Acad Sci USA 97 3319–3327 Occurrence Handle10.1073/pnas.050002797 Occurrence Handle1:CAS:528:DC%2BD3cXitlWqsbo%3D Occurrence Handle10725351
V Katju M Lynch (2003) ArticleTitleThe structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome Genetics 165 1793–1803 Occurrence Handle1:CAS:528:DC%2BD2cXnvFOnsQ%3D%3D Occurrence Handle14704166
FA Kondrashov IB Rogozin YI Wolf EV Koonin (2002) ArticleTitleSelection in the evolution of gene duplications Genome Biol 3 . Occurrence Handle10.1186/gb-2002-3-2-research0008
S Kumar SB Hedges (1998) ArticleTitleA molecular timescale for vertebrate evolution Nature 392 917–920 Occurrence Handle10.1038/31927 Occurrence Handle1:CAS:528:DyaK1cXjtV2jur8%3D Occurrence Handle9582070
W-H Li (1982) ArticleTitleEvolutionary change of duplicate genes Isozymes 6 55–92 Occurrence Handle6187709
M Lynch JS Conery (2000) ArticleTitleThe evolutionary fate and consequences of duplicate genes Science 290 1151–1155 Occurrence Handle10.1126/science.290.5494.1151 Occurrence Handle1:CAS:528:DC%2BD3cXotVChsb8%3D Occurrence Handle11073452
M Lynch M O’Hely B Walsh A Force (2001) ArticleTitleThe probability of preservation of a newly arisen gene duplicate Genetics 159 1789–1804 Occurrence Handle1:CAS:528:DC%2BD38XntFKquw%3D%3D Occurrence Handle11779815
M Nei (1969) ArticleTitleGene duplication and nucleotide substitution in evolution Nature 221 40–42 Occurrence Handle1:STN:280:CCaD283isVU%3D Occurrence Handle5782607
S Ohno (1970) Evolution by gene duplication Springer-Verlag Berlin
CA Paulding M Ruvolo DA Haber (2003) ArticleTitleThe Tre2 (USP6) oncogene is a hominid-specific gene Proc Natl Acad Sci USA 100 2507–2511 Occurrence Handle10.1073/pnas.0437015100 Occurrence Handle1:CAS:528:DC%2BD3sXitVais7o%3D Occurrence Handle12604796
F Tajima (1993) ArticleTitleSimple methods for testing the molecular evolutionary clock hypothesis Genetics 135 559–607
N Takezaki A Rzhetsky M Nei (1995) ArticleTitlePhylogenetic test of the molecular clock and linearized trees Mol Biol Evol 12 823–833 Occurrence Handle1:CAS:528:DyaK2MXns1yqsbg%3D Occurrence Handle7476128
T Tanaka M Nei (1989) ArticleTitlePositive selection observed at the variable-region genes of immunoglobulin Mol Biol Evol 6 447–459 Occurrence Handle1:CAS:528:DyaL1MXlslahs7Y%3D Occurrence Handle2796726
JD Thompson DG Higgins T Gibson (1994) ArticleTitleCLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22 4673–4680 Occurrence Handle1:CAS:528:DyaK2MXitlSgu74%3D Occurrence Handle7984417
C-I Wu W-H Li (1985) ArticleTitleEvidence for higher rates of nucleotide substitution in rodents than in man Proc Natl Acad Sci USA 82 1741–1745 Occurrence Handle1:CAS:528:DyaL2MXhvVSisro%3D Occurrence Handle3856856
Z Yang (1997) ArticleTitlePAML: A program package for phylogenetic analysis by maximum likelihood Comput Appl Biosci 13 555–556 Occurrence Handle1:CAS:528:DyaK2sXntlGnu7s%3D Occurrence Handle9367129
Z Yang R Nielsen (2000) ArticleTitleEstimating synonymous and nonsynonymous substitution rates under realistic evolutionary models Mol Biol Evol 17 32–43 Occurrence Handle1:CAS:528:DC%2BD3cXotF2qtA%3D%3D Occurrence Handle10666704
P Zhang Z Gu W-H Li (2003) ArticleTitleDifferent evolutionary patterns between young duplicate genes in the human genome Genome Biol 4 R56 Occurrence Handle10.1186/gb-2003-4-9-r56 Occurrence Handle12952535
Acknowledgments
This research was supported by NIH Grant GM66710. We are grateful for comments on the manuscript by Vaishali Katju and two anonymous reviewers.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hughes, A.L., Friedman, R. Recent Mammalian Gene Duplications: Robust Search for Functionally Divergent Gene Pairs. J Mol Evol 59, 114–120 (2004). https://doi.org/10.1007/s00239-004-2616-9
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s00239-004-2616-9