Introduction

Gene duplication is one of the most important mechanisms in the evolution of gene diversity, presumably because it is easier to achieve new functions by modifying preexisting genetic systems than by generating them de novo (Ohno 1970; Go 1981; Gilbert 1978; Hughes 1994). After gene duplication, gene copies can explore three possible routes: (1) one paralog can lose the original function by the accumulation of deleterious mutations (nonfunctionalization); (2) one paralog may gain a new function under positive selection for advantageous mutations (neofunctionalization); or (3) original functions are partitioned among the two paralogs (subfunctionalization) (Force et al. 1999, Lynch and Conery 2000). Since the vast majority of mutations are deleterious, the usual fate for one duplicate is nonfunctionalization (Ohno 1970).

Until recently, the traditional view was that positive selection is the usual mechanism whereby a duplicated gene avoids nonfunctionalization. Positive selection leads to fixation of mutations conferring a new or modified function in one of the copies shortly after the duplication event (Ohta 1988). Interestingly, there is often an acceleration of the nonsynonymous substitution rate following gene duplication (Li 1997; Lynch and Conery 2000). After a new function has evolved, however, amino acid evolution is expected to be dominated by purifying selection and the rate of nonsynonymous substitution should decrease (Ohta 1993). Under an alternative model, the duplication–degeneration–complementation (DDC) model, duplicates are maintained because the functions are partitioned between paralogs which complement each other. This occurs as a result of degenerative mutations, which accumulate differentially in functional domains or in regulatory regions of genes (Force et al. 1999). It is the partitioning of subfunctions, rather than the acquisition of new functions, that preserves the duplicates; hence, this model does not explicitly require a role for positive Darwinian selection, although differential purifying selection may be at work. Recent empirical studies suggest that this model is applicable to some gene families (e.g., Gerhardt and Kirchner 1997; DiLeone et al. 1998; Force et al. 1999). However, the relative importance of these models remains a matter of controversy (Mazet and Shimeld 2002).

The eutherian mammal β-globin family comprises five functional genes (β-, δ-, ε-, Gγ-, and Aγ- globin) and one pseudogene (ψβ) typically arranged in a specific linkage order (Fig. 1). All the functional β-globin genes encode the β chain of hemoglobin, a tetramer composed of two α and two β chains, which binds oxygen noncovalently (Perutz 1983). The β-globin family constitutes a classic example of molecular evolution by gene duplication. Globin paralogs have explored diverse evolutionary pathways, with some functional genes retaining their original function (i.e., encode the β chain of adult hemoglobins) (Bunn 1981), others having become nonfunctional (Lacy and Maniatis 1980; Cleary et al. 1981; Li et al. 1981; Martin et al. 1983; Goodman et al. 1984), and yet others having changed their function and time of expression (Farace et al. 1984; Hutchinson et al. 1984; Fitch et al. 1991; Meireles et al. 1995; Johnson et al. 1996). Expression is partitioned among developmental stages, with β- and δ-globins expressed entirely in adults, ε-globin expressed solely in the embryo, and γ-globin expressed in the embryo in some placental mammals and in the fetus in simian primates (Hardison et al. 1997). Hence in the evolution of this gene family, both the partitioning of expression and the divergence of the proteins are important factors.

Figure 1
figure 1

β-Globin gene linkage in different vertebrates (Cooper et al. 1996; Garner and Lingrel 1989; Konkel et al. 1979; Lacy et al. 1979; Kretschmer et al. 1981; Lingrel et al. 1983; Satoh et al. 1999; Schon et al. 1981; Shapiro et al. 1983; Townes et al. 1984; Schimenti and Duncan 1985b). Orientation is variable in fish and amphibian clusters (Gillemans et al. 2002; Hosbach et al. 1983).

In this paper, we investigate the pattern and process of evolution by gene duplication in the β-globin family. We assembled a dataset of 72 DNA sequences that include mammals, amphibians, fish, and birds. We inferred a phylogeny for the β-globin family and identified duplication events and gene conversions, some of which have not been reported previously. We were interested in testing some predictions of the neofunctionalization and subfunctionalization models in eutherian mammals, as their phylogenetic relationships are well studied and their sequences are not overly divergent. We wanted to contrast Ohta’s model with the DDC model, as they represent extremes in the debate on gene family evolution, although other models have been proposed (Ohno 1970, 1984; Patthy 1985; Gilbert 1978; Krakauer and Nowak 1999; Clark 1999; Kondrashov et al. 2002). Specifically we tested for (i) a significant increase in the rate of nonsynonymous substitution following gene duplication events, a consequence of neofunctionalization predicted by Ohta (1988); and (ii) significant differences in selective constraints among paralogs. Even though the DDC model is concerned with the evolution of regulatory regions we hypothesize that, if subfunctionalization occurs in the protein coding sequences, as well as in the regulatory sequences, selective pressure should differ between paralogs. We measured selective pressure by using the nonsynonymous/synonymous substitution rate ratio, as implemented in codon models of sequence evolution (Nielsen and Yang 1998; Yang et al. 2000). An ω < 1 indicates purifying selection, ω €‰= €‰1 is consistent with neutral evolution, and ω > 1 indicates positive Darwinian selection (Yang and Bielawski 2000).

Materials and Methods

Sequence Data

For the phylogenetic study of the β-globin gene family, 72 sequences from various vertebrates including fish, amphibians, bird, and mammals were obtained from GenBank. The nomenclature of β-globin genes is rather chaotic. To avoid confusion, we have included species names and GenBank accession numbers next to each sequence in Fig. 2. We used the bony fish clade as outgroup. Sequences were aligned using Clustal X (Thompson et al. 1997), followed by manual adjustments. Alignment gaps were removed.

Figure 2
figure 2

Maximum likelihood tree of the β-globin gene family. GenBank accession numbers are provided after species names. Support values for nodes are bootstrap proportions. Duplication events are marked by circles. Asterisks show species used in the eutherian mammal dataset (see Fig. 3). Double-line branches indicate taxa misplacement relative to the expected species topology. LS—Low support (<40%).

Phylogenetic Analysis

We constructed phylogenies for the β-globin genes to understand the relative order of duplication and speciation events and to identify gene conversions. Trees were estimated from the nucleotide sequences using maximum parsimony, maximum likelihood, and Bayesian analysis. Relative support for internal branches was measured using bootstrap analysis with PAUP* (Swofford 1998). We performed the SH (Shimodaira and Hasegawa 1999), KH (Kishino and Hasegawa 1989), and RELL (Kishino and Hasegawa 1989) tests to compare the inferred gene tree with an alternative topology derived from the expected species relationships. We compared two trees each time, the tree in Fig. 2 and a tree modified by relocating the misplaced taxa according to the species phylogeny. We used the programs PLATO (Crassly and Holmes 1997), Pist (Worobey 2001), GENECONV (Sawyer 1999), Reticulate (Jakobsen and Eastel 1996), and Partimatrix (Jakobsen et al. 1997) and estimated the Homoplasy Index (Maynard Smith and Smith 1998) to test for nonreciprocal recombination between paralogous genes, i.e., gene conversion. Conflict between the estimated gene tree and the species tree determined which sequences were tested for gene conversion.

Analysis of Selective Pressure

To examine the selective pressure acting on genes from the β-globin family (i.e., β-, δ-, ε-, and γ-globins), we used sequences from eutherian mammals only. We analyzed a dataset comprising β-, δ-, ε-, and γ-globin genes from eutherian mammals. The sequences are identified in the tree in Fig. 3. The dataset included the 20 sequences marked with an asterisk in Fig. 2 plus the following: brown lemur β-globin gene (M15734), ε-globin gene (M15735), γ-globin gene (M155757), chimpanzee δ-globin gene (AF339363), and Aotus γ-globin gene (AF016985). Primates possess two copies of γ-globin; we chose to sample the Gγ copy because it is less likely to be affected by gene conversion, as gene conversion is almost exclusively unidirectional with Gγ converting Aγ (Fitch et al. 1990). β-globin genes converted δ-globin genes in some lineages (Koop et al. 1989); therefore we excluded the converted δ copies. Also excluded were some of the internally duplicated genes in the ruminant β-globin cluster (goat εIII, εIV, εV, and εVI and cow εI and εIII) (Fig. 1). These sequences are very divergent due to inserted sequences (Saban and King 1994). From the mouse β-globin cluster (Fig. 1), we sampled one of the three copies of fetal globin (βh1) and one of the two adult globin genes (β1). Separate datasets also were constructed for β-, ε-, and γ-globin genes. There were too few sequences available for a separate analysis of the δ-globin gene.

Figure 3
figure 3

Maximum likelihood tree of the β-, δ-, ε-, and γ-globin genes from eutherian mammals. Gray circles indicate gene duplication events. Branches in the tree are partitioned into postduplication (PD; those that immediately postdate the gray circles) and postspeciation (PS; those that postdate species-divergences) branches. The tree is rooted at the proto-ε- and proto-β-globin duplication event to help interpret classification of branches. All analyses were conducted using unrooted trees. Labels indicate the classification of PD (—) and PS branches (===, - - -, €‰= €‰= =, and · · ·). This is a general representation; different tests focused on different PD and PS branches according to each case described in the text.

Site-Based Analyses

A statistical approach was taken to study the selective pressure on the β-globin gene family in eutherian mammals. We used several codon models of molecular evolution that allow for heterogeneous dN/dS ratios at sites (Nielsen and Yang 1998; Yang et al. 2000). In the simplest model (M0 or one-ratio model), the ω ratio is an average over all the sites. The “neutral” model (M1) allows for conserved sites where ω €‰= €‰0 and completely neutral sites where ω €‰= €‰1. The “selection” model (M2) adds a third class to M1 at which ω can take values > 1. The discrete model (M3) uses an unconstrained discrete distribution with different ω ratios for K different classes of sites. Model M7 (beta) assumes a beta distribution of ω over sites. Model M8 (beta and ω) adds an extra class of sites to M7, thereby allowing ω values > 1. Likelihood ratio tests (LRTs) were conducted to test M0 (one-ratio) against M3, M1 (neutral) against M2 (selection), and M7 (beta) against M8 (beta and ω). All analyses were based on the unrooted gene-tree topologies and used the codeml program in the PAML package (Yang 1997).

Branch-Based Analyses

To study changes in selective pressure in the context of gene duplication we implemented several models that allow for variable ω ratios among branches in the tree (Yang 1998; Bielawski and Yang 2003). The null model assumed the same ω for all lineages in the tree. The “PD–PS” model assigns different ω ratios for postspeciation and postduplication branches in the tree (e.g., Fig. 3). This is based on the hypothesis that duplicated genes avoid nonfunctionalization because positive Darwinian selection promoted fixation of amino acid mutations that led to a new or modified gene function (Ohta 1988). The hypothesis predicts a burst of amino acid replacements in the branches postdating duplication events (Ohta 1983). After a new function evolves, however, amino acid evolution is expected to be dominated by purifying selection and the rate of nonsynonymous substitution should decrease (Ohta 1993). Hence there should be a higher rate of amino acid substitution along branches that immediately postdate duplication events (PD branches) compared with those branches that immediately postdate speciation events (PS branches). An LRT can be conducted to compare the one-ratio model (ωPD €‰= €‰ωPS) with the two-ratio model PD-PS (ωPD ‰  ωPS).

Another alternative model was based on the hypothesis that duplicated genes avoid nonfunctionalization because expression patterns and/or functions are partitioned among paralogs following gene duplication (Force et al. 1999). If subfunctionalization had indeed occurred in the protein-coding sequences, sites associated with such partitioning are expected to exhibit long term differences in selection pressure. If the difference between paralogs is large, we might be able to detect paralog-specific differences in average selective constraint. We formalized this in a model called “Paralog,” where an independent ω ratio is specified for each paralogous clade (e.g., ωβ ‰  ωγ ‰  ωε). To test for a significant difference in selective pressure among paralogs we conducted an LRT comparing the one-ratio model (e.g., ωβγε) with the three-ratio Paralog model.

Branch-Site Analysis

The above approaches might not detect a short episode of positive Darwinian selection, such as immediately following a gene duplication event, if it occurs at just a fraction of amino acid sites. The “branch-site” models (models A and B) recently developed allow the ω ratio to vary both among lineages and among sites and permits detection of lineage-specific changes in selective pressure at specific amino acid sites (Yang and Nielsen 2002). Branch-site models A and B have four ω site classes. The first two site classes, with ω0 and ω1, are uniform across the phylogeny, whereas the other two site classes are allowed to change from ω0 †’ ω2 and from ω1 †’ ω2 in a pre-specified branch of interest (the “foreground” branch). Note that ω2 can take values >1, thus allowing for positive selection. In branch-site model A, ω0 is fixed to 0 and ω1 is fixed to 1; hence positive selection is permitted at only the foreground branch. Model A is compared with model M1 (neutral) with degrees of freedom (df) €‰= €‰2. In model B, ω0 and ω1 are free parameters; therefore some sites can evolve under positive selection across all the branches in the phylogeny, whereas other sites are permitted to take ω values >1 in the foreground branch. An LRT compares model B with model M3 (discrete) with K €‰= €‰2 site classes and df €‰= €‰2. We used branch-site models A and B to test for possible adaptive evolution along lineages following gene duplications.

Results

Phylogenetic Analysis

The 72 β-globin family genes in Fig. 2 were used for phylogenetic reconstruction. The ML tree is shown in Fig. 2. Both ML and Bayesian methods resulted in similar topologies, with support values for the internal nodes shown in Fig. 2. The only case of disagreement between the two methods was in the placement of marsupial and monotreme sequences. In the Bayesian tree the echidna β-globin gene was sister to a marsupial clade (opossum and dunnart β-globins), and in turn this clade was placed sister to the eutherian β-globin clade. In the ML tree (Fig. 2), the echidna β-globin gene was sister to the eutherian β-globin clade. Clearly, placement of the monotreme and marsupial β-globins is problematic and will probably require additional sampling to resolve. Interestingly, the marsupial ω-globin genes were placed outside the mammalian β-globin clade, consistent with the earlier study of Wheeler et al. (2001).

Assuming no gene conversion, we expected (i) monophyly for each set of paralogs (i.e., β-, δ-, ε-, and γ-globins) and (ii) to recover the expected species tree within each paralogous clade (Rowe 1999; O’Brien et al. 2001; Springer et al. 2003). However, we found some notable misplacements (double lines in Fig. 2): (i) the rabbit ε and γ sequences were sister to the primate ε and γ genes, respectively, rather than sister to rodent ε- and γ-globins; (ii) the cow εII and εIV genes and goat εII comprised a monophyletic clade sister to the γ-globins instead of being within the ε clade; (iii) the mouse ε gene (a single-copy gene traditionally called y) did not appear within the ε clade but was sister to a clade including the cow εII and εIV and the goat εII genes; (iv) tarsier and bushbaby δ-globin genes were sister to tarsier and bushbaby β-globin genes, respectively; (v) the genes traditionally labeled as γ-globins in sheep, cow, and goat were placed within the β-globin clade; (vi) chicken ε was sister to chicken ρ instead of being more closely related to duck ε-globin; and (vii) Cebus Gγ and Aγ were more closely related to each other than to their respective human and chimpanzee orthologs.

All misplacements were supported by high bootstrap proportions (>70%) with the exception of the Cebus Aγ and Gγ, the rabbit ε, and the mouse y branches, where there was low bootstrap support. We used the SH test to compare the expected placements with the estimated topology (Fig. 2). SH tests indicated significantly greater support for five misplacements (bushbaby δ, p < 0.0001; tarsier δ, p €‰= €‰0.002; cow, sheep, and goat γ, p €‰= €‰0.000; echidna β, p €‰= €‰0.053; and Cebus Aγ and Gγ, p €‰= €‰0.000). The remaining misplacements did not fit these data significantly better than the expected phylogenetic placements (mouse ε, p €‰= €‰0.095; rabbit γ, p €‰= €‰0.217; rabbit β, p €‰= €‰0.59; cow ε, p €‰= €‰0.193; rabbit ε, p €‰= €‰0.289; and mouse y, goat εII, and cow εII and εIV, p €‰= €‰0.225; chicken ρ, p €‰= €‰0.397). Results under KH and RELL tests were the same as with SH tests (data not shown).

A potential source of conflict between the gene tree and the species tree could be gene conversion (Ohta 1980, 1990). Hence, we used the misplacements to guide our tests of gene conversion. Tests were conducted on alignments of third codon positions only, by using different software programs (Grassly and Holmes 1997; Worobey 2001; Sawyer 1999; Jakobsen and Eastel 1996; Jakobsen et al. 1997; Maynard Smith and Smith 1998). We found evidence for two gene conversion events that are not reported previously: (i) among the duplicates in the goat β-globin cluster between nucleotide 12 and nucleotide 75 (site numbering refers to the human β-globin gene; PDB file 2hhb) (PLATO z-score €‰= €‰4.85) and (ii) among the mouse β-globin cluster genes between nucleotide 210 and nucleotide 235 (PLATO z-score €‰= €‰3.87). Our analysis corroborated gene conversions previously suggested for tarsier and bushbaby δ-globin genes (Koop et al. 1989; Grassly and Holmes 1997) between nucleotides 45–63 and nucleotides 357–375, and in cow εII and cow εIV between nucleotide 12 and nucleotide 30, in agreement with Schimenti and Duncan (1985a). However, we found no evidence for gene conversions between mouse β genes βh0 and βh1 or between mouse βh0 and mouse y, (see Figs. 1 and 2) reported by Hill et al. (1984), or between Cebus Gγ and Aγ.

Variable Selective Pressure Among Sites

In order to minimize the effect of gene conversion, we excluded the converted genes. Given that gene conversion tends to have a direction in globins, we knew for instance that δ-globins are generally converted by β-globin, and not vice versa. This prior knowledge allowed us to minimize gene conversion effects to some extent, although eliminating conversion altogether is impossible, as numerous events have characterized the evolution of β-globin genes. We also compared tests of variable selective pressure using different datasets, both with and without misplaced sequences. We obtained similar results for the different datasets, confirming that gene conversion, although probably present, did not greatly affect our results.

We expected selective pressure to vary among sites and among the genes of the β-globin family. We used codon models to detect among-site variability in selective pressure in the β-, ε-, and γ-globin genes. From the one-ratio model (M0) we found that the ω ratio averaged over all sites is 0.27, 0.26, and 0.17 for β-, γ-, and ε-globin genes, respectively, when the three genes were analyzed as separate data sets (Table 1). The estimates suggested that, on average, the ε-globin is more constrained than the γ and β. However, an ω ratio averaged over sites is a crude measure of selective pressure. Therefore we used models that allow selective pressure to vary among sites. The discrete model (M3), with three site classes, revealed considerable variation in selective pressure among sites (Table 1). For example, β-globin had 65% of sites under strong purifying selection (ω €‰= €‰0.02), 26% of sites were less constrained (ω €‰= €‰0.57), and 9% of sites were under positive selection (ω €‰= €‰2.02) (Table 1). Interestingly, neither γ nor ε showed evidence of sites evolving under positive selection (Table 1). Evolution of the majority of sites in all three paralogs was dominated by strong purifying selection, with 65% of sites in β, 52% of sites in γ, and 66% of sites in ε evolving with ω < 0.05.

Table 1 Parameter estimates and likelihood scores in separate analyses of the β-, γ-, and ε-globin genes under site-specific models

We tested for variable selective pressure among sites by conducting an LRT comparing the one-ratio model (M0) with the discrete model (M3); results were highly significant for all three genes (Table 2). In general, β-globin was the most variable gene in the family, having an additional class of sites evolving under positive Darwinian selection.

Table 2 Likelihood ratio test statistics for comparing site-specific models for the β-, γ-, and ε-globin genes

We were interested in identifying regions that are conserved in all three genes in the cluster, which presumably indicate functionally important residues in the protein product. For β-, γ-, and ε-globin genes separately, we plotted the approximate posterior mean of the ω ratio at each site (Fig. 4). Four regions are highly conserved in all three genes: (i) residues 28 to 38, located in helices B and C; (ii) residues 57 to 63, located in helix E; (iii) residues 79 to 81, located in helix F; and (iv) residues 87 to 101, located in helices F and G. When mapped onto the three-dimensional structure of the β-chain in hemoglobin, we found that sites within these four constrained regions were located mostly on the inner hydrophobic core of the subunit, the area around the heme pocket and the α1 β1 interface. In all cases we used the human β-globin chain structure (PDB: 2hhb) as reference to map sites into the three-dimensional structure. Residues 28 to 38 are distributed among the hydrophobic core, the α1β1 interface between monomers, and part of the heme pocket.

Figure 4
figure 4

Approximate posterior mean of the ω ratio for each site calculated under model M3 (discrete) for the (a) β-globin, (b) ε-globin, (c) γ-globin genes.

The site-specific codon models were also used to identify positive selection at sites, indicated by ω > 1. The selection model (M2), the discrete model (M3), and the beta and ω model (M8) allow ω > 1 at a fraction of sites (Yang et al. 2000). All three models were generally consistent in suggesting a small fraction of sites (4 to 9%) evolving under positive Darwinian selection (ω between 2.02 and 3.58) in the β-globin gene (Table 1). We tested significance of sites evolving under positive selection by an LRT comparing M7, which does not allow for such sites, with M8, which has an additional parameter that can accommodate sites with ω > 1. The test is highly significant for the β-globin gene (Table 2).

Variable Selective Pressures Among Branches

A burst of nonsymoymous evolution is often observed following gene duplication, and positive Darwinian selection is frequently invoked to explain this pattern. An LRT was used to test whether selective pressure is significantly different between postduplication (PD) and postspeciation (PS) branches in the β-globin gene phylogeny; i.e., (ωβ(PD) €‰= €‰ ωε(PD) €‰= €‰ωγ(PD)) €‰ ‰  €‰(ωβ(PS) €‰= €‰ωε(PS) €‰= ωγ(PS)). The LRT was not significant (Table 3), suggesting no difference between PD branches and PS branches. Furthermore, estimates of ω suggested strong purifying selection in both the PD and the PS branches (ω(PD) €‰= €‰0.34, ω(PS) €‰= €‰0.23). We also fitted a more general four-ratio model in which the branches postdating the three duplication events in the phylogeny were assigned independent ω ratios (ωβ(PD), ωε(PD), ωγ(PD) ω(PS)) and compared it with the one ratio model. Again, the LRT was not significant (Table 3), and none of the parameter estimates suggested positive Darwinian selection: ωβ(PD) €‰= €‰0.41, ωε(PD) €‰= €‰0.22, ωγ(PD) €‰= €‰0.08, ω(PS) €‰= €‰0.24. Note that in both PD–PS models tested, dN values averaged 0.024 and dS values averaged 0.101.

Table 3 Maximum likelihood estimates of ω ratios under branch-specific models and likelihood ratio test statistics when the model is compared with the null model M0 (one ratio)

The above analysis averages rates over all sites in the gene and may lack power in detecting positive selection. Thus we also used branch-site models A and B (Yang and Nielsen 2002) to detect positive selection at a subset of sites along specific lineages. We tested each postduplication branch in the β-globin phylogeny as defined in Fig. 3. We found no evidence for positive selection at branches immediately following the duplication event that gave rise to proto-β and proto-ε, or after the duplication that created ε and γ (data not shown). The duplication event that resulted in Aγ- and Gγ-globins is hypothesized to have occurred along the branch leading to the simian primates (Slightom et al. 1985), but cannot be resolved on a gene tree because of frequent gene conversion events. When we used a specific dataset comprising ε- and γ-globins (Fig. 5) and tested the branch where the duplication is thought to have occurred we found an increase in nonsynonymous substitutions (M1 vs MA, 2δ €‰= €‰37.16, df €‰= €‰2, p < 0.0001; M3 vs MB, 2δ €‰= €‰18.66, df €‰= €‰2, p < 0.0001). The dN value was 0.021 and the dS value was 0.039, as measured as an average over all branches of the ε- and γ-globin tree. Parameter estimates under models A and B suggested positive selection at a few sites along the branch leading to simian primates (ω2(MA) €‰= €‰10.0, ω2(MB) €‰= €‰4.58 in Table 4). Interestingly, this branch is also thought to coincide with the recruitment of γ-globins for fetal expression (double line in Fig. 5).

Table 4 Parameter estimates and log-likelihood scores for the γ-globin gene under different sites and branch-site models
Figure 5
figure 5

Maximum likelihood tree of the ε- and γ-globin genes from eutherian mammals. The double line corresponds to the branch where the Gγ and Aγ split is hypothesized to have occurred, in the ancestor of simian primates.

Globin genes are expressed at different developmental stages, so each gene might be subject to different selective pressures. To test for paralog-specific differences in selective pressure, we fitted the “Paralog” model, where β-, γ-, and ε-globins have independent selective pressures (i.e., ωβ ‰  ωε ‰  ωγ). This model fits the data significantly better than the one-ratio model, with parameter estimates ωβ €‰= €‰0.29, ωε €‰= €‰0.16, ωγ =0.23 (Table 3). The average dN value was 0.024 and the average dS value was 0.103. Those estimates are consistent with the ω estimates from the separate analysis of the paralogs, with ε-globin more constrained than γ- and β-globins (Table 1). Fitting additional models with two of the three ratios (ωβ, ωε, ωγ) forced to be identical suggests that ωγ is different from ωβ and ωε, while ωβ and ωε are not significantly different (Table 3).

Discussion

Gene conversion plays an important role in the evolution of multigene families, as it brings about the exchange of genetic material between related sequences (Schimenti 1994; Posada et al. 2002). It is a frequent mechanism of evolutionary change in globins and can act both to homogenize genes through concerted evolution (e.g., Aγ and Gγ in simian primates) and to introduce novelty among homologous genes (e.g., cow εII and εIV). Gene conversion is known to affect gene phylogenies (Slatkin and Maddison 1989; Hudson et al. 1992; Maddison 2000). Given the general importance of the mechanism, its pervasiveness, and its effects on phylogeny reconstruction, it is essential to test for gene conversion when topological discrepancies arise in a gene family tree (Drouin 2002). By using statistical methods, we found evidence of two unreported gene conversion events in β-globins—(i) among duplicates in the goat β-globin cluster and (ii) among duplicates in the mouse β-globin cluster—and we confirmed many previously suggested cases. Furthermore, we suggest that the majority of misplacements in our gene tree are the result of gene conversion events.

The traditional model of evolution by gene duplication predicts an increase in nonsynonymous substitution rate immediately after genes duplicate. It is a matter of debate whether this rate increase is due to a relaxation of selective pressure or to the action of positive selection for advantageous mutations (for a review see Massingham et al. 2001; Mazet and Shimeld 2002). Previous studies of the β-globin family supported the positive selection model, with this mode of evolution being suggested following the split of myoglobin and hemoglobin (Goodman 1981) and following the divergence of α- and β-hemoglobins (Czelusniak et al. 1982). Accelerated amino acid evolution also occurred after the en bloc duplications within the ruminant artiodactyl lineage (Li and Gojobori 1983). In contrast to these examples, we found no significant evidence for a burst of nonsynonymous evolution in the branches postdating the initial duplications of the proto-β and proto-ε genes, or after the duplication giving rise to the β and δ or to the ε and γ clades, which correspond to the major duplication events within the gene family. We also tested for an increase in nonsynonymous substitutions at particular sites along the postduplication branches using branch-site models but failed to detect an evolutionary burst. Interestingly, a recent study of the early stages of evolution of duplicate genes within the human genome found that most genes exhibit an accelerated rate of nonsynonymous evolution in one duplicate (Zhang et al. 2003). Our data suggest that the early divergences within the β-globin family of genes do not fit this pattern, as we found no such changes in the evolutionary rate during the early stages of divergence.

There was one exception to the general pattern described above. In the lineage of stem-simians, which represents the transition from embryonic to fetal expression of γ-globins (Tagle et al. 1988; Fitch et al. 1991), we detected an acceleration in nonsynonymous substitution rates and identified positively selected sites. Whereas previously used methods employed raw counts of synonymous and nonsynonymous substitutions, and were thus unable to determine the source of amino acid evolution acceleration, the branch-site models indicated that nonsynonymous rate acceleration in the lineage of stem-simian γ-globins was caused by positive Darwinian selection.

Although it is possible that undetected gene conversion affects our tests for variable dN/dS rate ratios among branches, we believe that our results are not overly influenced by it. Our estimates are based on the comparison of silent and replacement changes and both are similarly affected by gene conversion events. Furthermore, a recent simulation study (Anisimova et al. 2003) showed that LRTs are robust to low or moderate levels of recombination, such as those we might not have been able to detect. It could also be that greater similarity among sequences reduced the power of our tests to detect an increase in dN/dS rate ratios following gene duplication. However, we note that the tests were powerful enough in the case of the simian γ-globin amino acid replacement acceleration. Furthermore, if adaptive evolution occurs by a single or a small number of substitutions, it may not be detected by methods based on dN/dS ratios (Bielawski and Yang 2003). It is known that large phenotypic changes in globins can be achieved by only one or a few amino acid changes (Perutz 1983). A good example of the latter is provided by the deletion of the NA1 valine residue from the protein chain encoded by γ-globin in some artiodactyls, which increases the oxygen affinity of the hemoglobin monomer (Poyart et al. 1992). Hence, in cases where we did not detect positive selection or even an increase in amino acid replacement rates, our findings do not exclude the possibility of neofunctionalization in β-globin genes by a few adaptive substitutions with large phenotypic effects.

The DDC model of gene copy preservation does not require a burst of nonsynonymous substitutions and assumes that purifying selection continues to act on both gene copies following duplication (Force et al. 1999; Zhang 2003). Nonetheless, if subfunctions are partitioned among the functional domains of the encoded protein, a potential outcome of the DDC model is heterogeneity in purifying selection among the gene copies. Dermitzakis and Clark (2001) proposed that identification of heterogeneity in pattern of amino acid substitution between different domains of the proteins encoded by paralogous genes could lead to the discovery of genes under subfunctionalization. While the DDC model has traditionally centered on regulatory sequences, we extend the possibility of finding subfunctionalization to protein-coding sequences by identifying heterogeneous selection pressure among paralogs. In the case of mammalian β-globins, genes are linked in a specific arrangement which, in most species, is known to be related to the order of expression of the genes (Hardison 1998). If the arrangement of β-globin genes in the cluster corresponds to a domain-like partition of function, each domain of expression could be subject to different selective pressures. Hence, our results are in agreement with a subfunctionalization model, as we found that each paralogous clade (i.e., domain of expression) is subject to significantly different selective constraints. Our findings suggest a long-term process of divergence during which each paralog has been subject to different constraints by purifying selection, presumably related to differences in expression regulation. As described earlier, our findings do not exclude the possibility of brief episodes of increased amino acid replacement, in which case, other models (e.g., Ohta 1988) may still be relevant to the evolution of β-globins.

The β-globin gene is the only gene with sites predicted to be under positive selection in placental mammals. We identified 12 sites under Darwinian selection, consistent with the earlier study of Yang et al. (2001). These sites are located mostly at the exterior of the protein chain, with two sites located at the α1β1 interface between the α and the β subunit of hemoglobin (116H and 111A). As a prelude to a more detailed analysis, we tested for positive selection in the α-globin genes currently available in GenBank and found at least one positively selected site (115A) located at the α1β1 interface. Our results raise the interesting possibility of long-term coevolution of some alpha and beta protein chain residues located in the α1β1 interface. A more detailed study is necessary to rigorously examine this hypothesis.

Much is now known about what makes the globin fold a robust structure (Perutz et al. 1960; Bashford et al. 1987; Murzin and Finkelstein 1988; Brenner et al. 1997). Proteins whose secondary structures are mainly alpha helices, such as β-globin chains, are flexible and can easily accommodate many residues or prosthetic groups without disrupting tertiary or quaternary structural arrangements (Chothia et al. 1977; Efimov 1979). β-Globins share the canonical features of the globin fold and have maintained a robust structure despite 200 million years of evolutionary divergence (Efstratiadis et al. 1980; Czelusniak et al. 1982). Arguably, the most important feature that explains the preservation of the globin fold is the clear conservation of hydrophobic residues at buried positions in globin proteins (Lesk and Chothia 1980). In our study we identified regions conserved in all three β-globin genes, located in the interior or hydrophobic part of the subunit. Presumably, these conserved sites are involved in the maintenance of the secondary structures which in turn stabilize the tertiary and quaternary structures of hemoglobin. Furthermore, we found that some of the conserved sites are also part of empty concavities of the protein surface accessible to solvent (Liang et al. 1998). Concavities are particularly important, as they are often associated with binding and catalytic activity (Liang and Dill 2001). For example, of the 23 sites which participate in interactions with the heme group, 15 correspond to the conserved sites in our study, with 3 involved in hydrogen bonding. With the exception of site 38Thr, all sites that participate in interactions with the heme ligand have hydrophobic–hydrophobic contacts, which stabilize the structure. Hence, during the long evolutionary history of the genes encoding the β-globin chain of hemoglobin, these functionally and structurally important sites have been preserved, while at the same time a fraction of residues has been the target of divergent fine-tuning of the protein function.

Gene family evolution reflects a balance between homogenization by unequal crossing-over and gene conversion, and diversification by mutation (Ohta 2000). Both drift and selection play an important role in the evolutionary fate of duplicated genes, but only positive selection can account for the evolution of new functions (Ohta 1987). The dynamics of these forces are complicated (Ohta 2000), and our analysis of the β-globin family of genes illustrates this complexity. Gene conversion is clearly a frequent force for homogenization of some closely related members of this family (e.g., Aγ- and Gγ-globins). As expected, gene conversion is less important to the evolution of the more divergent members, as it is prohibited when sequence divergence is too high (Ohta 2000). In addition to the partitioning of β-globin paralogs into domains of expression, this gene family exhibits divergence both by positive Darwinian selection (β-and γ-globins) and by differential patterns of purifying selection pressure (γ-and ε-globins). While more tests are clearly necessary to fully discriminate between the DDC and Ohta models, we suggest that comparison between synonymous and nonsynonymous substitution rates provides a useful tool in studying relative roles of different evolutionary forces during the evolution of a gene family.