Introduction

The timing and order of historical speciation events within vertebrate animals may be revealed by constructing an evolutionary tree. Constructing such trees depends critically on obtaining suitable characters for phylogenetic analysis, such as nucleotide sites or morphological features. Many of the shallow divergences within vertebrate orders are routinely resolved with nucleotide sequence data. However, the basal relationships among vertebrates remain contentious (Murphy et al. 2001a, b; Arnason et al. 2002; Arnason and Janke 2002; Delsuc et al. 2002; Scally et al. 2002; Meyer and Zardoya 2003; Springer et al. 2003; Takezaki et al. 2003, 2004). A general strategy to overcome this problem has been to use larger datasets in the form of more taxa and/or more characters (Nei et al. 1998; Rosenberg and Kumar 2001b; Hillis et al. 2003; Rokas et al. 2003a, b; Rosenberg and Kumar 2003; Rokas and Carroll 2005; Teeling et al. 2005). However, even with larger datasets, contradictory results still arise (Meyer and Zardoya 2003), such as those observed for primates, rodents, and lagomorphs (Murphy et al. 2001a; Arnason et al. 2002).

The debate over appropriate datasets for such analyses is fundamentally a debate over the utility of assorted characters. This debate has motivated attempts to empirically characterize the phylogenetic power of particular classes of molecular genetic data (Graybeal 1994; Russo et al. 1996; Naylor and Brown 1997; Collins et al. 2005). For instance, a rough estimate of phylogenetic utility has commonly been obtained by assessing the percentage sequence divergence for multiple genes between two relevant taxa for which there are extensive data available, the assumption being that moderate variability (say, 80–95%) indicates appropriate informativeness (Murphy et al. 2001a). Graybeal (1994) pointed out that such a single measure does not capture the relevant molecular evolutionary patterns that comprise phylogenetic informativeness. Primarily, heterogeneity of informativeness emerges from the fact that sites evolve at different rates both within and among genes. It has long been noted that sites evolve at different rates among different genes (Li 1997). Recently, it has been argued that sequence data can be filtered to contain those sites with the greatest signal of past speciation events (Glazko and Nei 2003; Springer et al. 2003; Hedges et al. 2004). For instance, analyses may be restricted to only genes or sites that satisfy a test for clock-like evolution. The application of a molecular clock has been argued to alleviate some of the concerns over varying rates among lineages (Glazko and Nei 2003; Hedges and Kumar 2003). Complementarily, incorporating knowledge of the heterogeneity of evolutionary rates among sites into models can lead to dramatic improvement in phylogenetic method (Yang 1996; Felsenstein 2001). For the case of a protein-coding region, there is heterogeneity in the rate of change of both nucleotides and amino acids among sites. These rates for each site may be estimated to determine whether a gene is appropriate or not appropriate for a particular phylogenetic study. The phylogenetic informativeness (Townsend 2007) uses the site rate estimates to profile phylogenetic informativeness for specified historical epochs. Here, we calculate the suitability of protein-coding regions for reconstruction of deep divergences in the vertebrate tree.

To demonstrate this approach for resolution of the phylogenetic relationships among vertebrates, here we examine the pattern of molecular evolution and consequent phylogenetic utility of a set of paradigmatic genes originally characterized by Graybeal (1994). Graybeal analyzed the phylogenetic utility of these genes by plotting proportional sequence divergence against divergence time of taxa as millions of years ago (MYA). Such empirical saturation plots convey phylogenetic information content, justified by the claim that those genes whose divergence increases over the time scale of interest are the genes most likely to be informative about phylogenetic relationships emerging within that time scale. Although this claim is intuitively sensible and likely to be true in a fairly robust way, it is difficult to characterize the frequently nonlinear relationship between time and sequence divergence with much accuracy. Once characterized, it remains difficult to draw from such an empirical saturation plot precise quantitative predictions about the utility of genes for particular epochs. Here we use phylogenetic informativeness (Townsend 2007) to analyze the four genes that Graybeal (1994) identified as exemplars of particular kinds of molecular evolution (albumin, cytochrome b, c-myc, and calmodulin) as well as seven genes that Graybeal predicted would be of high utility for further resolution of the deep vertebrate phylogeny. Our results are consistent with those predictions, but provide improved precision. Both the study by Graybeal (1994) and this study examine the genes for their phylogenetic information at both the nucleotide and the amino acid level. We address which genes contain more phylogenetic information for the epochs of interest, as well as the controversy over whether the amino acid sites or the underlying nucleotide sites are more informative (Russo et al. 1996; Simmons 2000; Simmons et al. 2002a, b, 2004a, b; Gissi et al. 2006). Amino acids are often advocated because they are generally less saturated than nucleotide sites for ancient divergences. However, nucleotide sites have a threefold advantage in number. Here we compare the amount of signal present in nucleotides and amino acids, without directly addressing issues of noise. We show that there is a considerably greater signal in nucleotides due to their threefold greater representation, arguing for their use in analyses of epochs and taxa that are not subject to strong biases due to convergence. We also note that the low level of noise concomitant with analyses based on amino acid sequence leads to fairly high support values even when the net amount of signal is low.

Methods

Sequence Data

We obtained nucleotide sequences for the following vertebrate genes: albumin, cytochrome b, calmodulin, and c-myc. For each gene, representative mammalian sequences identified by Graybeal (1994) were obtained from the GenBank database. A BlastP (Altschul et al. 1997) homology search revealed other available mammalian sequences. When possible, we also used sequences from a bird (Gallus), an amphibian (Xenopus), and a bony fish. Note that lengths of genes reported here represent the lengths of alignable sequence across all taxa that could be processed to produce site-rate estimates and, in some cases, amount to ≤90% of the length of gene in any individual taxon.

Phylogenetic Trees

In order to compute the rates of evolution among nucleotide and amino acid sites, we first needed to specify an evolutionary tree of vertebrates. This tree was obtained from Springer et al. (2003) and is based on both molecular and fossil data. The branch lengths, which correspond to divergence times, were taken from Springer et al. (2003), complemented by Poux and Douzery (2004).

Calculating the Molecular Evolutionary Rate Among Sites

Using the sequence data and the phylogenetic tree, molecular evolutionary rates were estimated for each site. To align the nucleotide sequences for each of the four genes in this study, ClustalX (Thompson et al. 1997) was first used to align the amino acid sequences. The amino acid alignment provided a guide with which we aligned the nucleotide sequences via MUSCLE (Edgar 2004). We used the baseml and codeml components of PAML (Yang 1997) to obtain the evolutionary rates at both nucleotide sites and amino acid sites. For estimation of each site rate, we maximized the likelihood of the ultrametric tree in units of time for that site, enforcing a global clock. The Jukes-Cantor model, for nucleotides, and the Poisson model, for amino acids, were assumed. Because the data are so finely partitioned when estimating independent rates for each site, it was appropriate to use molecular evolutionary models with minimal parameterization.

To ensure that low-complexity models did not lead to spurious results, we also repeated the analysis using the Akaike information criterion (AIC) in Modeltest 3.7 (Posada and Crandall 1998) and ProtTest 1.4 (Abascal et al. 2005) to select the nucleotide and amino acid substitution models that best fit the data. To infer site rates using these more complex models, we used Hyphy 0.99 (Pond et al. 2005) and Rate4site (Mayrose et al. 2004). Since rates were estimated independently for each site, we parameterized the estimation using the best-fit model without the gamma shape (G) and invariant sites (I) parameters that would, contrary to our main intent, tightly constrain the ultimate site-rate distribution. We fixed all other model parameters to the genewide estimates so that the rate was the only parameter to be optimized for each site. Phylogenetic informativeness profiles and phylogenetic informativeness per site values for various epochs differed among genes but were virtually identical across models for each gene, except in the cases of cytochrome b and gelsolin. For these two genes, the use of TVM for cytochrome b and GTR for gelsolin resulted in the estimation of very fast rates for several saturated sites. These fast rates resulted in profiles with a skew toward greater informativeness in more recent times. In the case of cytochrome b, our conclusions about differential relative informativeness over epochs discussed below should consequently be considered conservative. With the less complex model, the profile for cytochrome b is already highly skewed in this direction. In the case of gelsolin, it is possible that this gene may have greater utility for recent times than estimated using the less complex model. However, one should be skeptical of fast rates estimated for saturated sites under complex models.

Calculating the Phylogenetic Informativeness

For each gene, the phylogenetic informativeness profile ρ as a function of time, T, was calculated as

$$ \rho (T;\lambda_{1} , \ldots ,\lambda_{n} ) = \sum\limits_{i = 1}^{n} {16\lambda_{i}^{2} } Te^{{ - 4\lambda_{i} T}} , $$

substituting the estimated rates λ i of evolution of each site i (Townsend 2007). This formula provides a metric of the probability that character i would provide an unambiguous synapomorphy lying within an asymptotically short internode between two pairs of sister taxa whose common ancestor is at time T. Integrations of the phylogenetic informativeness profile from the origin (h 1 ) to the terminus (h 2 ) of the epochs of interest, \( \int_{{h_{1} }}^{{h_{2} }} {\rho (T;\lambda ){\text{d}}T} \), were performed in Mathematica (Wolfram Research, Inc.). This formulation of phylogenetic informativeness is a metric of the probability of a substitution occurring along the short internodes of phylogenetic quartets that is not later obscured by subsequent substitution, and captures the primary component of informativeness (Townsend 2007) for shallow and deep tree reconstruction underlying the common methods of phylogenetic reconstruction (cf. Rosenberg and Kumar 2001a).

For each gene, we summed the phylogenetic informativeness of all sites in that gene to reveal a net phylogenetic informativeness that conveys the degree to which that gene (compared to others) is predicted to contribute to resolution of the phylogeny during history. We also divided that informativeness by the number of sites, resulting in the phylogenetic informativeness per site, which reveals the relative power of the genes normalized by their length. This measure reveals the power per site sequenced. Phylogenetic informativeness per site is of interest both because the cost vs. benefit of sequencing and analysis may be quantified with such a measure, and because it is interesting to compare relative power of genes without the primary influence of gene length. However, for most purposes, the net phylogenetic informativeness is the prediction of interest, as it should correlate with empirical results, such as the degree of support of a node. Thus, unless specified otherwise, our use of the term phylogenetic informativeness refers to the net phylogenetic informativeness.

Results

The phylogenetic informativeness per nucleotide site of cytochrome b peaked at very recent times (10–50 MYA), then dropped precipitously below that of albumin at around 60 MYA and below that of c-myc at around 170 MYA (Fig. 1). For divergences more recent than 60 MYA, albumin was nearly as informative as cytochrome b; between 60 and 200 MYA, the informativeness of albumin was greater than that of cytochrome b, c-myc, and calmodulin. The superior informativeness of albumin nucleotide sequence diminished as it approached 500 MYA (Fig. 2a); in epochs more than 500 MYA, its informativeness dropped below that of other genes. Though cytochrome b had the highest phylogenetic informativeness per nucleotide site for times more recent than 60 MYA, it showed remarkably less promise for times prior to ~100 MYA. The other three genes did not show as extreme a peak of informativeness and, instead, maintained their relative performance over a 500 million-year time scale.

Fig. 1
figure 1

The phylogenetic informativeness per site of the nucleotide sequences of four genes, from 200 million years ago to the present. Integration of the phylogenetic informativeness profile reveals the greater phylogenetic informativeness per site of albumin (red) compared to cytochrome b (blue) between 65 million and 107 million years in the past. For reference, phylogenetic informativeness per site for c-myc (green) and calmodulin (yellow) is also graphed

Fig. 2
figure 2

Phylogenetic informativeness profiles, assumed ultrametric tree, and trees estimated from the nucleotide and amino acid sequences of albumin. (a) Phylogenetic informativeness of the nucleotide (blue) and amino acid (red) sequences. (b) Dated ultrametric tree extracted from Springer et al. (2003) and Poux and Douzery (2004), assumed to calculate site rates and phylogenetic informativeness. (c) Maximum likelihood tree estimated from the nucleotide sequences with Tree-Puzzle (Schmidt et al. 2002), using the HKY85 model (Hasegawa et al. 1985) and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 0.97. A molecular clock for the nucleotide sequences was rejected by the likelihood ratio test in Tree-Puzzle (P < 0.05). Nodes are labeled with quartet puzzling support. (d) Maximum likelihood tree estimated from the amino acid sequences with Tree-Puzzle, using the JTT model (Jones et al. 1992) and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 1.78. A molecular clock for the amino acid sequences was rejected by the likelihood ratio test in Tree-Puzzle (P < 0.05). Nodes are labeled with quartet puzzling support

Comparison of the profiles of informativeness for these nucleotide sequences (Figs. 2a, 3a, 4a, 5a) against organismal clock-enforced trees (Figs. 2b, 3b, 4b, 5b) illustrated the greater potential for signal in cytochrome b nucleotide sequence for questions of recent vertebrate evolution (e.g., recent mammal divergences) and the greater potential for signal in albumin nucleotide sequence for all of the rest of verteb rate history. The phylogenetic informativeness profiles of the amino acid sequences of these genes (Figs. 2a, 3a, 4a, and 5a) showed a pattern of informativeness similar to the pattern of informativeness of the nucleotide sequences, with two significant differences. First, albumin yielded the highest informativeness over all times: the amino acid sequence of cytochrome b did not feature the outstanding peak of informativeness for recent times that characterized its nucleotide sequence. Second, calmodulin, which was at least weakly informative via its nucleotide sequence, demonstrated virtually no phylogenetic informativeness via its amino acid sequence due to low divergence.

Fig. 3
figure 3

Phylogenetic informativeness profiles, assumed ultrametric tree, and trees estimated from the nucleotide and amino acid sequences of cytochrome b. (a) Phylogenetic informativeness of the nucleotide (blue) and amino acid (red) sequences. (b) Dated ultrametric tree assumed to calculate site rates and phylogenetic informativeness. (c) Maximum likelihood tree estimated from the nucleotide sequences, using the HKY85 model and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 0.29. A molecular clock for the nucleotide sequences was rejected (P < 0.05). Nodes are labeled with quartet puzzling support. (d) Maximum likelihood tree estimated from the amino acid sequences, using the JTT model (Jones et al. 1992) and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 0.39. A molecular clock for the amino acid sequences was rejected (P < 0.05). Nodes are labeled with quartet puzzling support

Fig. 4
figure 4

Phylogenetic informativeness profiles, assumed ultrametric tree, and trees estimated from the nucleotide and amino acid sequences of c-myc. (a) Phylogenetic informativeness of the nucleotide (blue) and amino acid (red) sequences. (b) Dated ultrametric tree, assumed to calculate site rates and phylogenetic informativeness. (c) Maximum likelihood tree estimated from the nucleotide sequences, using the HKY85 model and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 0.54. A molecular clock for the nucleotide sequences was not rejected. Nodes are labeled with quartet puzzling support. (d) Maximum likelihood tree estimated from the amino acid sequences, using the JTT model (Jones et al. 1992) and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 0.64. A molecular clock for the amino acid sequences was not rejected. Nodes are labeled with quartet puzzling support

Fig. 5
figure 5

Phylogenetic informativeness profiles, assumed ultrametric tree, and trees estimated from the nucleotide and amino acid sequences of calmodulin. (a) Phylogenetic informativeness of the nucleotide (blue) and amino acid (red) sequences. (b) Dated ultrametric tree, assumed to calculate site rates and phylogenetic informativeness. (c) Maximum likelihood tree estimated from the nucleotide sequences, using the HKY85 model and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 0.19. Nodes are labeled with quartet puzzling support. (d) Maximum likelihood tree estimated from the amino acid sequences, using the JTT model (Jones et al. 1992) and eight categories of site rates constrained by a gamma distribution with an α parameter estimated to be 97.03. A molecular clock for the amino acid sequences was not rejected. Nodes are labeled with quartet puzzling support

Branch lengths and bootstrap support values for trees estimated from nucleotide and amino acid sequence data were in accord with the relative utility of genes as predicted by the phylogenetic informativeness. Albumin, which was predicted to have the greatest phylogenetic informativeness over the greatest range of time, provided an accurate and well-supported estimate of the phylogeny examined. Although trees estimated from both the nucleotide and amino acid sequences of albumin (Fig. 2c, d) incorrectly placed (human, macaque) as sister clade to (dog, (boar, (cow, sheep))) rather than as sister clade to (rat, mouse), the low support for this incorrect placement was complemented by high support values and branch lengths for all other, correct, components of the tree.

As predicted by the profile of phylogenetic informativeness for the nucleotide sequence of cytochrome b (Fig. 3a), the signal in recent times was manifested in the resulting nucleotide trees by long branch lengths for recent mammalian lineages (Fig. 3c). Consistent with the dramatic drop in informativeness and the potential role of noise, deeper mammalian divergences were poorly resolved. Amino acid sequence, predicted to be marginally informative (Fig. 3a), fared little better (Fig. 3d).

In contrast with the unequable informativeness of cytochrome b across epochs, the informativeness of cmyc, while not high, was relatively evenly distributed across epochs for both nucleotide sequence and amino acid sequence (Fig. 4a), and resulted in short branch lengths with moderate support (Fig. 4c, d). While the tree topology was consistent with and therefore equally accurate as that inferred from albumin sequence, support for individual nodes with both nucleotide and amino acid sequence was generally not quite as high as for albumin, and branch lengths were shorter.

The phylogenetic informativeness profiles for calmodulin predicted a minimal level of informativeness for nucleotide sequence back to 500 MYA and virtually no resolution from calmodulin amino acid sequence (Fig. 5a). The low support values and near-zero branch lengths for the tree estimated from nucleotide sequences (Fig. 5c) were consistent with the former prediction. The lack of topological content or support in the tree estimated from amino acid sequences (Fig. 5d) was consistent with the latter prediction.

In general, profiles projected the rates of evolution of the sites into informativeness levels over time. Several profiles of informativeness per nucleotide site featured extreme spikes of informativeness during very recent times (e.g., cytochrome b and albumin in Fig. 1 and Fig. 3a, respectively, near time 0 on the x-axis). In each case, these spikes in informativeness corresponded directly to one or a few site(s) whose maximum likelihood estimates of evolutionary rate were extremely high. Fast-evolving sites may be very useful for recent phylogenetic divergences. However, likelihood estimates for sites that evolve rapidly are typically imprecise, so such localized spikes should be considered with some skepticism. Removal of the fastest sites from the profile calculation abrogated the spike near the y-axis and left the rest of the profile unchanged.

The relative informativeness of the nucleotide sequences of albumin, cytochrome b, c-myc, and calmodulin for phylogenetic inference within the past 500 million years is displayed in Fig. 6a. In this figure, the gene informativenesses was normalized over all loci and all times so that the total normalized informativeness at all times is one. This normalization facilitates visual comparison of the informativeness of genes across extensive time scales. Although cytochrome b nucleotide sequence was relatively more informative for fairly recent inference, albumin was most informative for all times between 100 and 500 MYA, and for a significant portion of the time before 100 MYA. Albumin demonstrated a comparative monopoly of informativeness with regard to amino acid sequence over the past 500 million years (Fig. 6b), with cytochrome b being superceded by c-myc for inference of ancestral branching order earlier than ~150 MYA.

Fig. 6
figure 6

Relative phylogenetic informativeness of the nucleotide sites of albumin (red), cytochrome b (blue), c-myc (green), and calmodulin (yellow), and their translated amino acid sites for the past 500 million years. (a) Relative phylogenetic informativeness of the nucleotide sites. (b) Relative phylogenetic informativeness of the amino acid sites

Integration over a region of the phylogenetic informativeness profile of a gene supplies a quantitative assessment of the gene’s utility for a particular epoch (Townsend 2007). For instance, integration of the profile between 65 and 107 MYA (Fig. 1) provided the percentage of phylogenetic informativeness provided by that gene that lies within the epoch of the rapid radiation of placental mammals, allowing direct comparison of genes for their utility in resolving phylogenetic conundrums of the placental mammalian radiation.

Integrations over the periods of the mammalian, vertebrate, and metazoan radiations revealed the relative utility of genes for phylogenetic inference (Table 1). Genes selected by Graybeal (1994) for their predicted utility in resolving deep divergences among vertebrates are listed in Table 1 in order of their informativeness per nucleotide site for the vertebrate radiation (375–405 MYA). Per-nucleotide phylogenetic informativeness measures are reported as percentages, conveying the percentage of the total informativeness of that gene that lies within the epoch noted. Although some genes were consistently better than others across much of this time scale (e.g., albumin), others were quite heterogeneous in their utility (e.g., cytochrome b as an extreme, cmos as a more typical example). Phylogenetic informativenesses of nucleotide and amino acid sequence data were clearly correlated, but not rigidly. In particular, infomativenesses for certain genes (cytochrome b, calmodulin) and for certain epochs (7–26 MYA) appeared to have weaker correlations.

Table 1 Phylogenetic informativeness per site

The phylogenetic informativeness per site was fairly similar for amino acid sequence and for nucleotide sequence over the epochs examined (Table 1). In 19 cases the phylogenetic informativeness of DNA sequences was higher, and in 18 cases the phylogenetic informativeness of amino acid sequences was higher. In the other cases they were nearly equal per site. However, because a gene has threefold more nucleotide than amino acid sites, amino acid phylogenetic informativeness per site would have to be three times as great as nucleotide informativeness per site to be equivalent. The instances where amino acid informativeness is greater (e.g., albumin for the mammalian radiation, somatotropin for the vertebrate radiation) showed differences that are smaller in scale than the threefold reduction in number of characters that ensues from choosing to analyze amino acid rather than nucleotide sequence. Phylogenetic informativeness of the nucleotide and amino acid sequence data for the four paradigmatic genes is plotted in Figs. 2a, 3a, 4a, and 5a. In all four genes, the phylogenetic informativeness is higher for the nucleotide data than for the corresponding amino acid sequence data. The ratios of nucleotide-to-amino acid sequence informativeness are most dramatic for cytochrome b and calmodulin. Branch lengths in inferred tree topologies were markedly shorter for all trees estimated from amino acid sequences compared to corresponding trees estimated from nucleotide sequences, except perhaps for the case of albumin. Only for the calmodulin dataset, however, were support values for nodes of the inferred tree dramatically lower for trees estimated from amino acid sequence compared to nucleotide sequence.

Discussion

The genes selected by Graybeal (1994) for their utility for resolving deep divergences among vertebrates are fairly consistent in their strength for that purpose. Indeed, the least powerful of those selected, gelsolin, was the gene for which the least diverse sequence data were available at the time (Graybeal 1994). Of the nuclear genes, albumin performed best, regardless of whether DNA or amino acid sequences were used. This high informativeness stands in contrast to calmodulin, with its slow rate of amino acid evolution—an uninformative gene for reconstructing vertebrate divergences. Note that albumin, the gene that showed the most promising informativeness, has a long history of phylogenetic utility, dating back to the Sarich and Wilson (1967) study of ape evolution. The strong predicted and actual performance of albumin is tempered by an exception of a poorly supported component of the topology (Fig. 2c and d) that places (human, macaque) as sister to (dog, boar, (cow, sheep)) rather than sister to (rat, mouse). This exception demonstrates the need for analysis of data from multiple informative genes even when a single gene is optimally informative. Even when numerous sites change at an optimal rate, substitution events are stochastic, and changes of state may not map to the relevant branches.

Furthermore, Takezaki and Gojobori (1999) showed the importance of accommodating rate variation among sites for reconstructing the vertebrate tree from mitochondrial DNA sequences. Our results support the conclusion of Takezaki and Gojobori (1999) that cytochrome b is not an ideal gene for reconstructing speciation events for midlevel to deep vertebrate phylogeny. Similarly, Graybeal’s (1994) empirical saturation plots indicated that cytochrome b comparisons accumulated little divergence in silent or replacement sites beyond 80 MYA. Although some of the dominance of albumin in phylogenetic informativeness derives from albumin’s length (1764 nucleotides), most of its informativeness derives from its pattern of molecular evolution. As an osmotic pressure regulator and generalized serum lipid transport protein, many sites of the gene are free to vary. This difference in pattern may be even more clearly seen in the relative phylogenetic informativeness plot (Fig. 6). This format for presentation of the phylogenetic informativeness displays more clearly the relative informativeness of genes during more ancient epochs, when all profiles show diminished absolute informativeness. Graybeal’s (1994) empirical saturation plot for albumin demonstrated that the encoded amino acid sequence is changing fairly rapidly. The result of this rapid evolution of amino acid sites is a remarkable degree of utility to both the nucleotide sequence of the gene and the amino acid sequence of the protein for phylogenetic resolution in recent times, in comparison with cytochrome b, c-myc, or calmodulin. These other proteins have much more constrained patterns of amino acid sequence evolution.

Using the phylogenetic informativeness to evaluate the utility of genes for construction of the vertebrate tree predicts that nucleotide sequence data will generally outperform amino acid sequence data. Such a conclusion should be approached cautiously, however, because amino acid sequence is generally predicted to provide fewer homoplasious sites than the underlying nucleotides. The major reasons for this expectation are the slower substitution rate and greater number of character states for amino acids. Although the Townsend (2007) phylogenetic informativeness explicitly addresses the relation of signal to the rate of evolution, it does not account for the probability of misleading noise due to convergence. Thus, the degree to which the greater state space of amino acid sequence deters the misleading effects of noise remains to be fully understood (Simmons et al. 2004a). Naturally, the greater potential state space of amino acid sites conveys some degree of increased utility to amino acid data by diminishing the misleading effect of noise. This expectation is consistent with our results in the following way: in our study, greater net signal resulted in significantly longer branch lengths for nucleotide trees, but only moderately higher quartet puzzling support. Presumably, noise from the smaller state space of nucleotide sites resulted in weakened support for nodes in nucleotide trees. Nevertheless, the prediction of higher informativeness for nucleotide data at these divergences is consistent with the simulation results of Simmons et al. (2004a), as well as the empirical results of Murphy et al. (2001a), who found that the combined amino acid sequences from 11 coding nuclear genes lacked sufficient power to resolve basal vertebrate relationships that were revealed by the combined nucleotide dataset.

Reconstruction of the vertebrate phylogenetic tree has been attempted using many strategies—increasing numbers of sites, enforcing or relaxing a molecular clock, increasing the number of taxa sampled, and restricting analysis to homogeneously evolving sites, to transversions, or to first, second, or third positions. In this study, we argue that gathering more sites of data or more taxa alone is not the optimal strategy. Parsing out the sources of signal conflict in increasingly large datasets will improve phylogenetic inference of the systematic relationships among vertebrate lineages (Meyer and Zardoya 2003). Moreover, it may be insufficiently revealing to sample genes haphazardly, and then remove genes for which the molecular clock has been rejected or remove sequences for which inhomogeneity of the evolutionary process is detected (Kumar and Gadagkar 2001; Collins et al. 2005; Fedrigo et al. 2005). By estimating the full distribution of rates at each site, one may select genes that will perform best for the epoch of interest, whether for recent divergence times or for more ancient divergence times. Profiles of phylogenetic informativeness may be used to select optimal genes to sequence for the resolution of peculiarly recalcitrant polytomies in otherwise well-resolved trees (e.g., Lara et al. 1996; Lessa and Cook 1998), as well as to substantiate any claim that a polytomy is truly “hard” or “soft.” Only when a dataset is highly informative for the epoch in question should a hard polytomy be invoked. For the time scale examined here (< 500 million years), our results generally encourage the use of DNA sequences instead of amino acid sequences, especially in information-sparse situations where biases are not resulting in extensive phylogenetic conflict. Conventional wisdom dictates the use of intuitive expert judgment in choosing genes appropriate for phylogenetic questions, perhaps including consideration of the evolutionary rate of the gene. However, different sets of characters evolving at about the same average rate may show utterly different phylogenetic informativeness profiles (Townsend 2007). One of the main utilities of the method presented here is to give an explicit quantitative procedure for judicious experimental design that goes beyond intuitive common sense. We have focused here on a familiar and well-studied set of genes so as to establish the power of the method to provide information consistent with experience gathered from numerous studies. By performing such comparisons in a larger-scale, automated manner, this method will help to identify and quantitatively rank genes or fragments of genes that evolve at adequate rates for specific phylogenetic problems. By furnishing quantitative estimates of the informativeness for specific time periods, profiling phylogenetic power enables optimal experimental design recommendations, and the ranking of commonly used genes provided here should enhance experimental design by furnishing a greater understanding of the heterogeneity of change across sites and its effect on the accurate reconstruction of evolutionary trees.