Introduction

Gene duplications are one of the most important mechanisms for the origin of evolutionary novelties (Ohno 1970). Duplicated genes are observed from deep to recent levels of evolutionary divergence indicating that gene duplications occurred throughout evolutionary history. There are different types of gene duplications: whole genome, segmental, and tandem duplications. Whole genome duplications occured in yeast, vertebrates, ciliates, and plants (Aury et al. 2006; Scannell et al. 2007; Tang et al. 2008; Kuraku et al. 2009). Tandem gene duplications were observed in nearly every species whose genome was sequenced so far. After duplication, possible evolutionary fates of paralogous genes are nonfunctionalization or neofunctionalization of one duplicate, or subfunctionalization of both copies (Force et al. 1999; Lynch and Force 2000). A complete redundancy over long evolutionary time periods seems unlikely because mutational pressure will ultimately lead to the nonfunctionalization of one of the two genes (Clark 1994; Lynch et al. 2001; O’Hely 2006).

Plant genomes contain high frequencies of duplicated genes (AG Initiative 2000; IRGS Project 2005; Tuskan et al. 2006; Jaillon et al. 2007). One of the most important mechanisms is polyploidization, which has affected more than 50% of all plant species (Blanc and Wolfe 2004b; Soltis and Soltis 2009). In the lineage leading to Arabidopsis thaliana, at least two whole genome duplications occurred at ≈40 million years ago (mya), and probably more than 200 mya (Blanc et al. 2003; Raes et al. 2003). In addition, plant genomes contain a high proportion of duplicated genes that arose by tandem duplications (AG Initiative 2000; Rizzon et al. 2006). Although the basic patterns of genome duplications at different evolutionary levels are now established, little is known about the role of natural selection in the subsequent fate of genes (Hahn 2009). Previous studies of gene duplications in A. thaliana, rice, and other plant species showed that positive selection can drive sequence divergence of both segmentally and tandemly duplicated genes. Examples include pollen-specific oleosins (Schein et al. 2004; Fiebig et al. 2004), genes involved in defense-related secondary metabolism (Benderoth et al. 2006; Mita et al. 2006), and disease resistance genes (Mondragón-Palomino et al. 2002; Kuang et al. 2004; Sun et al. 2006).

Functional analyses and expression patterns suggest that functional divergence of duplicated genes is common. In Arabidopsis, 57% of recent and 73% of older duplicates show divergent expression patterns (Blanc and Wolfe 2004a). Between 31.6 and 85% of pairs of paralogous Arabidopsis genes differ in their tissue-specific expression patterns (Duarte et al. 2006). However, neither functional divergence nor divergence in expression pattern are sufficient to distinguish between neo- and subfunctionalization, unless the ancestral state of expression is known (Lynch and Conery 2000). When the ancestral state of expression was taken into account, only a few of the paralogous pairs were diverged in a way that was fully consistent with either a classic subfunctionalization or neofunctionalization model (Duarte et al. 2006). The majority of duplicated genes apparently underwent both neo- and subfunctionalization (He and Zhang 2005; Rastogi and Liberles 2005).

In the present study, we analyzed rates of sequence evolution to estimate the importance of selection in the divergence of pairs of paralogous genes in A. thaliana. Spillane et al. (2007) described the evolution of the imprinted MEDEA (MEA) gene, which originated by a recent genome duplication and acquired new functions during embryo development. Furthermore, it showed a strong signal of positive Darwinian selection during this period, whereas the sequence and function of its paralog SWINGER (SWN) remained highly conserved. SWN also showed a high level of genetic redundancy with its common ancestor gene, CURLY LEAF (CLF). These results suggested a neofunctionalization of MEA gene but not of its paralog SWN. Here, we analyze how frequently a similar neo- and subfunctionalization of duplicated paralogs can be observed on a genome-wide level. We identify pairs of duplicated genes that were either duplicated as a result of the two whole genome duplications in the past of A. thaliana, or which arose from tandem duplications. By using orthologous genes from the poplar genome as outgroups, we calculated lineage-specific rates of evolution and conducted tests of selection. We found that about 6.9% of A. thaliana paralogous gene pairs exhibit significantly different rates of sequence divergence between duplicated genes.

Materials and Methods

Sequence Data

The genome of A. thaliana was obtained from MIPS (ftp://ftpmips.gsf.de/cress) and TAIR (genomes release 6, ftp://ftp.arabidopsis.org). The Populus trichocarpa genome (version 1.1) was obtained from JGI (http://genome.jpi-psf.org/Poptr1_1/). The segmentally duplicated A. thaliana gene clusters as determined by Blanc et al. (2003) were downloaded from http://wolfe.gen.tcd.ie/athal/all_results. This dataset contains 3,044 pairs of genes in 91 distinct blocks, of which 3,041 genes were consistent with recent genome annotations. The Oryza sativa genome (release 5.0) was obtained from http://rice.tigr.org and the Saccharomyces cerevisiae genome from http://www.yeastgenome.org.

Determining Paralogous and Orthologous Relationships

We follow the established nomenclature for differentiating between orthologs and paralogs (Koonin 2005). Paralogs whose origin predates a speciation event are called outparalogs; they may be misidentified as orthologs if different paralogs are deleted in different lines. Inparalogs originated after a speciation event and are specific to a particular lineage. To identify clusters of inparalogs in the A. thaliana and P. trichocarpa genomes, INPARANOID (version 2.0) was used (Remm et al. 2001), because this program performs well in ortholog classification with a sensitivity and specificity >80% (Chen et al. 2007a). The INPARANOID algorithm identifies inparalogs from two species using BLAST (Altschul et al. 1990) similarity scores between pairs of sequences. The two-way best hits of genes between species are considered as seed orthologs and form a cluster, potential inparalogs are successively added to this seed pair. A BLAST-based clustering assumes equal evolutionary rates among paralogs (Li et al. 2003), but differential levels of selection or differences in the mutation rate among inparalogs may lead to unequal rates. Hence, we changed the default parameters of INPARANOID to allow the inclusion of more divergent inparalogs. The BLAST score cutoff was raised from 50 to 100, which reduces the number of pairwise comparisons used in the clustering step. We also lowered the confidence level for inclusion of inparalogs from 0.5 to −0.5, which increases the number of potential inparalogs for each cluster. The default setting of INPARANOID requires a positive confidence value for a gene to be accepted as an inparalog, but genes evolving under strong positive selection may violate this assumption if they are highly divergent from their paralogs.

After the INPARANOID run, only clusters with exactly two A. thaliana inparalogs were retained for further analysis. For the Arabidopsis inparalogs, we use following nomenclature: The seed ortholog is denoted as At-1 and the added inparalog as At-2. It should be noted that clusters with >2 inparalogs can also be analyzed with appropriate models. Following Remm et al. (2001), we further eliminated all clusters whose BLAST scores were inconsistent with the species phylogeny when S. cerevisiae and O. sativa genomes were used as outgroups.

Refining and Aligning Orthologous Gene Clusters

Orthologous gene clusters identified by INPARANOID were compared with A. thaliana paralogs identified by Blanc et al. (2003) to extract only segmentally duplicated genes. Several paralogs appeared to be located in a segmentally duplicated gene cluster but were not included in the Blanc et al. (2003) dataset due to annotation inconsistencies (i.e., changes of the gene identifier code). To detect such genes, we used a sliding window technique for neighboring genes in segmentally duplicated regions. Groups of inparalogs, which were part of at least six neighboring genes in the same order on both clusters in a window of 20 genes were added to the set of genes from the Blanc et al. (2003) data. The sliding window analysis also identified genes that were present as tandem duplicates within one of two segmentally duplicated regions, but not in the other (Supplementary Figure S1) and they were also included in the analysis. Protein sequences of gene clusters were aligned with CLUSTAL (Higgins 1994) and corresponding gap-free codon-based alignments were generated with PAL2NAL (Suyama et al. 2006). The DNA sequence alignments were used to obtain the tree topology using DNAML from the PHYLIP package (Felsenstein 2005). We obtained 12,573 distinct clusters, of which 3,754 clusters contained >1 and 2,845 clusters exactly two A. thaliana inparalogs (Fig. 1). We also added 203 clusters with >2 A. thaliana inparalogs by considering only A. thaliana inparalogs with a positive confidence value. Among 3,048 INPARANOID clusters with exactly two A. thaliana inparalogs, 2,109 (70%) were identified as segmentally duplicated. Of these, 185 clusters could not be processed by PAL2NAL due to inconsistencies between DNA and protein data that mainly were observed among the poplar sequences and likely result from sequencing errors or wrongly predicted splicing sites; they were excluded from further analysis.

Fig. 1
figure 1

Cluster size distribution of inparalog clusters obtained from INPARANOID runs. Left panel size distributions of inparalog clusters from A. thaliana, poplar, and both species combined. Right panel size distribution of clusters included in the analysis

Tests of Sites Under Selection

The ratio of nonsynonymous substitutions per nonsynonymous site (d N) to the synonymous substitutions per synonymous sites (d S), ω = d N/d S, can be used as a test of natural selection (Yang and Bielawski 2000). Positive selection is inferred if ω > 1, purifying selection if ω < 1, and neutral evolution if ω = 1.

We used branch-site models (Forsberg and Christiansen 2003; Bielawski and Yang 2004) to infer ω ratios with the PAML package (Yang 1997). Clade model C (a branch-site model, Forsberg and Christiansen 2003; Bielawski and Yang 2004) was used to detect differences in the proportion of selected sites in the lineages between the two A. thaliana inparalogs. Note that in contrast to branch-site model A of the PAML package the clade model C does not assume a fraction of sites with ω > 1. Tests for significant differences among models were calculated as likelihood ratio tests (LRTs), where the test statistic was 2Δl = 2 × (l 1 − l 2) with l 1 and l 2 as the log of the maximum likelihood (ML) estimated of the two models compared. It is assumed that 2Δl is approximately distributed as χ2 with difference of model parameters as degrees of freedom (d.f.), and critical values were obtained from this distribution. Model C estimates the proportion p 0 of codons with ω0 < 1 and a proportion p 1 of sites with ω1 = 1 for all branches combined and additionally a proportion p 2 of codons which are allowed to differ between the foreground (ω3) and background branches (ω2). We used an extension of clade model C as implemented in PAML version 4.4 which allows for two types of foreground branches (ω3 and ω4). For each cluster, a LRT was carried out with the two A. thaliana paralogs as separate foreground branches (ω3 ≠ ω4) compared with the clade model C for which the two A. thaliana paralogs belonged to the same foreground branch (ω3 = ω4) assuming d.f. = 1 (Fig. 2). We also excluded those clusters for which tree length was larger than the number of branches of the phylogeny and for which the posterior distribution for ω3 and ω4 significantly overlapped (>5%).

Fig. 2
figure 2

Outline of branch-site models used in the study. The phylogeny illustrates the most frequent case with two P. trichocarpa inparalogs (n = 2). The two branches leading to the A. thaliana inparalogs At-1 and At-2 of INPARANOID are labeled with ω3 and ω4, respectively. There are always two inparalogs in A. thaliana, but the number of inparalogs in poplar ranges from 1 to n (resulting in 2n + 1 branches). The duplication events are indicated by black circles. The branches leading to At-1 and At-2 are chosen as foreground branches. The proportions p 0 with ω0, p 1 with ω1, and p 2 of codons with ratio ω2, ω3, and ω4 are estimated. The proportion of codons under purifying selection (0 < ω0 < 1) and neutral codons (ω1 = 1) is estimated together for the whole phylogeny. At Arabidopsis thaliana, Pt Populus trichocarpa

Simulation Studies

The number of sequences in the alignment and the evolutionary distance (i.e., the average number of substitutions in a codon) strongly affect the power of a LRT to detect selection (Anisimova et al. 2001). Our samples are characterized by a low number of sequences and a high level of silent site degeneracy, which both reduce the power to detect lineage-specific evolutionary rates. Therefore, we conducted simulations with the evolver program of the PAML package for three sample clusters. Using the tree topology and the parameter estimates from the codeml branch-site analysis, two sets of 100 alignments each were generated (Table 4). For the first set (Simulation 1), data were simulated using the estimated values of the branch-site test with ω3 ≠ ω4. For the second set (Simulation 2), the estimated values of the branch-site test were used for the case ω3 = ω4. We determined the power and accuracy by conducting the branch-site test on the simulated sequences and counting how often LRTs were significant. We expect the proportion of significant rejected branch-site tests for Simulation 1 to be substantial, while for Simulation 2 the proportion of significant LRTs should be low if our approach is reasonable. However, an important factor is the quality of the alignments as insertions and deletions may produce shorter and less accurate alignments (Fletcher and Yang 2010). Consequently, when gaps are removed from the alignment a proportion of the remaining codons will be incorrectly aligned. These partly misaligned sequences could generate false positive or negative results when the branch-site test is applied. We therefore were interested how insertions and deletions would alter the outcome of the branch-site tests on the simulated sequences. For this analysis, we constructed 100 alignments using the parameter values of the branch-site test with INDELIBLE (Fletcher and Yang 2009). The length distribution of indels and the distribution of indels across the sequences are not known for A. thaliana and P. trichocarpa. We therefore use a scenario with equal rates of insertions and deletions and estimated the parameters as follows: The distribution of indels can be approximated by the Lavalette distribution (Fletcher and Yang 2009) for which the probability P of an indel of size u is given by

$$ P(u) = \left( {\frac{uM}{M - u + 1}} \right)^{ - a} $$
(1)

where u = 1,2,…,M where M is the maximum indel size. It is not clear which parameter space is reasonable for a and M. However, from the estimate of the mean indel length it is possible to obtain a value for a for a given M. Empirical estimates range from 1.5 to 2 (Zhang and Gerstein 2003; Yamane et al. 2006; Cartwright 2009). We obtain values for a between 2.01 and 2.35 for M = 500 and 1.61 and 1.93 for M = 200 (Supplementary Table S1). We show simulation results for M = 500 only, but results with other parameter values are very similar.

Analysis of Expression Profiles

Arabidopsis thaliana gene expression data were obtained from the Nottingham Arabidopsis Stock Centre microarray database (NASCArray). Hybridization experiments differ by the number of controls, and also the labeling procedures were not consistently standardized. Therefore, we used the raw expression values to calculate Kendall’s τ when comparing expression. Correlations could be calculated for 76% of all pairs (1,463 of 1,924) of inparalogs, because only a subset of A. thaliana genes were included on the microarrays. Owing to the heterogeneity of the type and conditions of experiments, the correlation coefficient should be considered as a rough estimate of co-expression (Table 1).

Table 1 Comparisons of mean correlations of co-expression between A. thaliana genes

Comparison of Inparalog Sequence Divergence with Intraspecific Sequence Variation

Resequencing data of 20 A. thaliana accessions obtained with the Perlegene array (Clark et al. 2007) were downloaded from TAIR. The site frequency spectrum was obtained for each locus. We estimated the proportion of adaptive substitutions, α, using an extension of the McDonald–Kreitman (MK) test (McDonald and Kreitman 1991) which takes into account the influence of slightly deleterious mutations (Eyre-Walker and Keightley 2009). Since many genes showed little polymorphism, we split each pair of inparalogs with significantly different ω ratios into two groups and summed data across genes. One group contained genes for which ω3 or ω4 was highest (relaxed group), while the other group harbored the remaining inparalogs (constrained group). Polymorphism data from 225 genes were available for comparison. Lineage-specific divergence was retrieved from the estimates of the free-ratio model from PAML.

Overrepresentation of Gene Ontology terms

Arabidopsis thaliana gene ontology Ashburner et al. (2000) annotations were obtained from the NASCArray. The GO term descriptors were retrieved from the gene ontology website (http://www.geneontology.org). We tested which GO terms are over- or underrepresented among pairs of inparalogs in comparison with all genes, and among inparalog pairs with significant LRTs in comparison with remaining inparalog pairs. A hypergeometrical distribution was assumed, which was approximated with a χ2 distribution for large numbers.

Results

Extraction of Segmentally Duplicated Genes

To extract segmentally duplicated genes from the A. thaliana genome, we obtained pairs of inparalogs with INPARANOID and filtered them with the slightly expanded Blanc et al. (2003) dataset. In the end, 1,924 pairs of A. thaliana inparalogs were analyzed together with their poplar homologs using PAML (Fig. 1), of which 1,588 (82%) are also contained in the Blanc et al. (2003) data. More than 80% of the analyzed clusters consisted of at least four sequences, but 376 clusters consisted of three genes only. A majority of 1,774 clusters (92%) originated in the recent and 57 (3%) in the old duplication event; 93 gene clusters (5%) are tandemly duplicated genes. Among inparalog clusters, functional groups of genes are differentially represented (Supplementary Table S2). GO terms related to terms cellular locations and metabolic processes are under-represented, whereas genes associated with the nucleus, DNA binding, and transcriptional activity are overrepresented.

Pairwise Comparisons of Homologs

The rate of evolutionary divergence between inparalogs and orthologs was estimated as the ratio of nonsynonymous (d N) to synonymous (d S) substitutions, \( \omega = d_{\rm{N}} /d_{\rm{S}} \), in pairwise comparisons of sequences (PAML runmode = −2; Fig. 3). A total of 1,862 (96.7%) A. thaliana inparalog pairs showed d S < 10 and ω < 20 which were used as cutoff values; the median ω value was 0.116. Among 8,712 Arabidopsis-poplar pairwise comparisons, 5,678 (65.2%) showed d S < 10 and ω < 20 and a median ω value of 0.0535. The d S value is assumed to represent the neutral mutation rate since synonymous codon positions are supposed to be largely free from selection. Our data agree with this hypothesis, as pairwise d N values are less variable than d S values. P. trichocarpa is a close relative to Medicago and similar pairwise d S values in comparisons with A. thaliana were obtained for the Populus–Arabidopsis (2.2; Fig. 3) and Medicago–Arabidopsis (2.0–2.2; Blanc et al. 2003) comparisons. In contrast, the median d S value for the A. thaliana inparalog comparison is approximately 1.0, which reflects the high proportion of genes originating from the recent genome duplication. Subsequently, the median ω value of the Arabidopsis-poplar orthologs is smaller than for the Arabidopsis inparalogs because higher d S values decrease the ω ratio.

Fig. 3
figure 3

Distribution of d S, d N, and ω values in pairwise comparisons of homologous regions. The mode values are indicated with dashed lines

Tests of Different Selection Pressures After Duplication

We used branch-site models (clade models C) to carry out a test of differences in selection pressures between A. thaliana paralogs (Fig. 2). We find 493 of 1,924 gene clusters (25.6%) resulting in a significant LRT. However, for 299 clusters the posterior distributions of ω3 and ω4 values overlapped significantly and were therefore excluded. Out of the remaining 194 clusters, 62 had unreasonable high tree length estimates in at least one of models and were as well excluded. The remaining set consisted of 132 clusters (6.9%) showed significant differences in ω ratios between the two A. thaliana inparalogs. Out of these 132 clusters, we found for 79 clusters (59.8%) with ω3 or ω4 larger than one. Seed orthologs had smaller ω values than the second inparalog in 114 (86.3%) of 132 significant clusters. This number indicates that INPARANOID uses conserved members of a gene family to identify orthologs in other species, and then adds more divergent paralogs to a cluster. Among the 132 inparalog pairs with significant LRTs, five GO terms were overrepresented (Table 2). They include genes associated to nucleotide binding, protein amino acid phosphorylation as well as response to stress.

Table 2 Overrepresented GO terms for gene pairs identified by the branch-site model

Identification of Selection-Driven Genes for Functional Analysis

One goal of this study was the identification of new candidate genes for further functional analyses. Since our study was motivated by the rapid evolution of MEDEA, which controls reproductive development and is likely involved in a genomic conflict, we were interested in genes with elevated ω. Fifteen inparalog pairs which have been identified by the branch-site model and showed ω > 1 for a substantial proportion of sites (n > 100) are shown in Table 3. These clusters contain genes that are involved in stress response (Mao et al. 2006; Sun et al. 2007; Kim et al. 2008), development (Bernhardt et al. 2010) and disease resistance (Kesarwani et al. 2007). Surprisingly, according to the TAIR literature database five out of those 15 gene pairs have yet to be functionally characterized.

Table 3 Examples of duplicated gene pairs with evidence for ω > 1 in one paralog lineage

Simulation Studies

For three sample clusters (Fig. 4), we checked the power to detect differences in the selective pressure using parameter estimates from the branch-site model (Table 4). We conducted simulations by applying LRTs to 100 simulated sequences for each of the three clusters. In 61–96% of cases, the LRTs were significant (Simulation 1). In contrary, in simulations with equal rates between the lineages (ω3 = ω4), only 7–13% of the LRTs were significant (Simulation 2). We also investigated how misaligned codons would alter the outcome of the branch-site test and modeled insertions and deletions into the simulated alignments using INDELIBLE (Table 4). In 41–91% of cases, the LRTs were significant (Simulation 3). In contrary, 9–14% of LRTs were significant if we model equal evolutionary rates between the lineages (Simulation 4). We therefore conclude that the given parameter values are reasonably well captured by the test statistic and the impact of indels to the outcome of the branch-site test is limited.

Fig. 4
figure 4

Phylogenetic trees of three INPARANOID clusters. a Cluster 745 including protein kinase At1g12460, b cluster 5758 including Rhomboid homolog protein 6 (At1g12750), and c cluster 6450 including carbohydrate binding At1g10150. Tree topologies were obtained with PHYLIP, and branch lengths (substitutions per codon) were calculated with CODEML using the nearly neutral model. Note that the trees are unrooted. ω for sites obtained from the branch-site model are indicated for both Arabidopsis inparalogs in bold

Table 4 Summary of simulation studies for three Inparanoid clusters

Estimating the Amount of Adaptive Substitutions

We estimated the proportion α of amino acid substitutions that underwent positive selection since the duplication event. For this, we used polymorphism data from a resequencing study of 20 A. thaliana accessions (Clark et al. 2007). A MK type of analysis (see “Materials and Methods” section) was used. Under the assumption that synonymous mutations are neutral, α can be estimated from simple expressions contrasting within-population polymorphism and corresponding levels of between-species divergence at two categories of sites (e.g., synonymous and nonsynonymous sites). Since we were interested in the amount of adaptive substitutions the 132 gene pairs with significantly different ω ratios underwent since the duplication event we used lineage-specific divergence data estimated from the free-ratio model of PAML.

An α value of 0.24 (0.11–0.35) was observed for relaxed inparalogs of the branch-site model which is significantly larger than 0 (Fig. 5a). In contrast, estimates for the constrained inparalogs are −0.33 (−0.59, −0.04) which supports the hypothesis that the relaxed inparalogs have undergone more adaptive evolution. We also estimated the distributions of fitness effects (DFE) for the two sets of genes (Fig. 5b). The results differ substantially between the two groups of genes. For the relaxed group, the proportion of neutral mutations (0 < N e s < 1) is decreased while the proportion of strongly deleterious mutations (N e s > 100) is increased. This indicates that these genes are subject to stronger purifying selection or have a higher effective population size. The DFE of the constrained inparalogs corresponds to a previous genome-wide estimate for A. thaliana (Gossmann et al. 2010) obtained from a different dataset (Nordborg et al. 2005).

Fig. 5
figure 5

Estimates of α and the distribution of fitness effects for the indentified gene pairs. a α, the proportion of fixed amino acid differences since the duplication event driven by positive selection, for pairs of A. thaliana inparalogs with significantly different ω3 and ω4 values. To conduct the test, each gene pair is split into either constrained or relaxed inparalog depending on their ω3 and ω4 values from the branch-site model test. Polymorphism and divergence data are summed across genes. b Estimates of the distribution of fitness effects for the two groups (constrained and relaxed) of genes. Mutations are binned according to their fitness effects

Relationship Between Protein Sequence and Expression Pattern Divergence

Both the neo- and subfunctionalization models accommodate the functional divergence of paralogs in protein function or expression pattern. Genes whose sequences show evidence for positive selection or subfunctionalization after gene duplication may also evolve more rapidly in their expression pattern. We conducted t tests on correlation of co-expression (Table 1), using publicly available microarray experiment data. The average co-expression correlation is increased for pairs of inparalogs than for random pairs of genes (P = 4 × 10−29). Within the set of inparalog pairs, co-expression is significantly reduced for gene pairs identified by the branch-site model (P = 3 × 10−3).

Discussion

The publication of several plant genomes has identified tens of thousands of novel genes over the last decade and the discovery of new genes will continue with the advance of new sequencing technologies (Ellegren 2008). Given the much slower pace with which the function of newly discovered genes can be determined, the evolutionary and bioinformatic characterization of genes is an essential first step in describing genome structure and evolution because it can be automated to a large degree. In the present study, we were interested in estimating the proportion of genes evolving at different evolutionary constraints after gene duplication by calculating the rate of nonsynonymous to synonymous divergence since their origin by duplication. We hypothesized that paralogous genes with significantly different rates of sequence evolution became functionally divergent, because rate differences result either from adaptation to a new function, or from different levels of constraint (i.e., differences in level of purifying selection) after genes acquired new functions.

To investigate the proportion of neo- and subfunctionalized genes in A. thaliana, we constructed a largely automated analysis pipeline based on a whole genome comparison of A. thaliana and P. trichocarpa. In principle, paralogs in P. trichocarpa could be analyzed as well, however, since information about segmental clusters in P. trichocarpa was limited we focused on A. thaliana genes. We generated a dataset of 1,924 pairs of duplicated A. thaliana genes (inparalogs) and their orthologs from the poplar genome with INPARANOID. Genes with annotation inconsistencies and unreasonable high d N, d S, and ω values were excluded because the quality of alignment and of the reconstructed phylogenies are crucial for estimating correct ω values (Wong et al. 2008). As a preliminary analysis, we conducted branch tests to identify differences in selection pressure between the A. thaliana inparalogs (results not shown), of which only 3 out of 1,924 clusters were significant using Bonferroni correction. But a few amino acids may be sufficient for functional divergence of paralogs and there is little power of branch models to detect selection (Anisimova et al. 2001; Studer et al. 2008) especially considering the large divergence since the last duplication event in Arabidopsis. To address this issue, we analyzed our dataset using a branch-site model and determined 132 gene pairs with different evolutionary rates between the two A. thaliana lineages.

Possible explanations for the severe discrepancies between the branch and branch-site tests are (i) a low power of the branch models to detect positive selection, (ii) different levels of purifying selection among paralogs, (iii) subfunctionalization rather than positive selection among paralogs detected with the branch model, (iv) unreasonable parameter estimates of the the branch-site model. To rule out the last possibility, we conducted simulations to estimate the power of PAML to detect positive selection with the branch-site model in alignments with few sequences (Table 4). They indicate that the applied branch-site test is sensitive enough to detect differences in the selective pressure between the lineages using the parameters estimated from the data. On the other hand, the rate of false positives is relatively low. We conclude that models rejected by the branch-site tests could be explained by subfunctionalization of one of the copies resulting in a fraction of sites evolving nearly neutrally. We also investigated by simulations the impact of alignment errors caused by indels on the branch-site model. We observed only a slight reduction in the power to detect significant differences if the impact of indels to the alignment is taken into account. These results suggest that the branch-site test as implemented in our study is fairly robust against alignment errors.

According to GO term descriptors, genes with regulatory activities are enriched among inparalog pairs. After duplication, changes in the regulatory sequence or in the coding sequence can lead to neofunctionalization or subfunctionalization. Our analysis was restricted to coding sites. Genes which are associated with nucleotide binding and amino acid phosphorylation are overrepresented for gene pairs rejected by the branch-site model tests. The enrichment of regulatory genes reflects the hypothesis that changes in regulatory sequences may contribute to the amount of neo- and subfunctionalized genes even though they are not directly covered by our approach. This is also consistent with the observed co-expression pattern of inparalogs as expression values showed a significantly reduced correlation in co-expression for gene pairs identified by the branch-site models. Such a reduction of correlation may be the consequence of a neofunctionalization of one of the copies or of a subfunctionalization of both copies. A previous attempt to distinguish the two possible scenarios by considering an inferred ancestral state of expression (Duarte et al. 2006) revealed that only few gene pairs can be assigned to one of the two categories, instead, a mixture of both models may apply. The ω ratio integrates the selection pressure for a period of about 40 mya for paralogs originating from the recent duplication event (Blanc et al. 2003). To differentiate between the hypotheses that most divergence among inparalogs originated immediately after duplication, or that either one or both inparalogs are evolving rapidly until the present, sequences from additional species covering the phylogenetic distance since duplication are required. Then, variation of ω ratios in the phylogeny of each inparalog can be calculated with greater confidence.

Since genome sequences of close relatives are not yet available, an extension of the MK test was applied to compare lineage-specific divergence and polymorphisms. An application of the MK analysis to paralogs may lead to an overestimate of the amount of adaptive substitutions (α) because polymorphism data provide information only on recent positive or purifying selection but not historical selection (Hahn 2009). Nevertheless, an estimate of α is still meaningful for two reasons. First, it provides an upper boundary of the proportion of fixed adaptions and second, a comparison of the distribution of fitness effects provides a comparable measurement of the recent evolution of both duplicated genes. The comparison of two sets of inparalogs has the advantage that demographic history and breeding system, which both have an effect on the site frequency spectrum and hence on the estimate of α (Eyre-Walker 2006; Foxe et al. 2008), are identical for both groups of inparalogs and do not affect the inference of selection in different ways. Owing to power reasons, we applied each MK analysis to a summed statistic for two groups of genes, dividing each inparalogous pair into either constrained or relaxed categories, according to ω estimates of the branch-site model. Note that this might increase the estimate of α for the relaxed genes. However, differences in the α estimates are largely caused by differences in P N/P S (Fig. 5b). Divergence estimates do not differ significantly between the two groups of genes (P = 0.69 and P = 0.11 for d N and d S, respectively). Therefore, our estimates for genes with a significantly higher ω ratio for a fraction of sites indicate that up to 24% of the divergence since the duplication may be attributed to positive selection.

Conclusions

Our results imply that around 6.9% of the analyzed A. thaliana paralogous gene pairs show different rates of evolution after gene duplication. Asymmetry of selective pressure supports either increased positive selection or relaxation of purifying selection. Other mechanisms such as preservation of duplicate genes by originalization also may be important (Xue and Fu 2009; Tanaka et al. 2009). Our estimate might be a severe underestimate because we had to exclude a substantial number of sequences due to the high degeneracy of A. thaliana paralogs and the relatively highly divergent outgroup. Furthermore, functional differentiation by other mechanisms, such as alternative splicing or gene dosage effects, was not covered by our approach. Studer et al. (2008) have shown that positive selection has been pervasive during vertebrate evolution, but whole genome duplicates had no effect on the prevalence of positive selection. Direct tests for positive selection in yeast and Drosophila are even higher than our estimates (Conant and Wagner 2003), while estimates for Xenopus laevis are lower (Chain and Evans 2006). A recent study in human, macaque, mouse, and rat genomes for young duplicates revealed that about 10% of duplicated gene pairs evolved under positive selection (Han et al. 2009) using a branch-site test. Our result is the highest ever reported value for A. thaliana and shows that selection after duplication contributes substantially to gene novelties and hence functional divergence in plants.