Site-Specific Evolutionary Rate Inference: Taking Phylogenetic Uncertainty into Account

Mayrose, Itay; Mitchell, Amir; Pupko, Tal

doi:10.1007/s00239-004-0183-8

Site-Specific Evolutionary Rate Inference: Taking Phylogenetic Uncertainty into Account

Published: March 2005

Volume 60, pages 345–353, (2005)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Molecular Evolution Aims and scope Submit manuscript

Site-Specific Evolutionary Rate Inference: Taking Phylogenetic Uncertainty into Account

Download PDF

Itay Mayrose¹,
Amir Mitchell² &
Tal Pupko¹

234 Accesses
30 Citations
Explore all metrics

Abstract

The evolutionary rate at an amino acid site is indicative of how conserved this site is and, in turn, allows evaluating the importance of this site in maintaining the structure/function of the protein. When evolutionary rates are estimated, one must reconstruct the phylogenetic tree describing the evolutionary relationship among the sequences under study. However, if the inferred phylogenetic tree is incorrect, it can lead to erroneous site-specific rate estimates. Here we describe a novel Bayesian method that uses Markov chain Monte Carlo methodology to integrate over the space of all possible trees and model parameters. By doing so, the method considers alternative evolutionary scenarios weighted by their posterior probabilities. We show that this comprehensive evolutionary approach is superior over methods that are based on only a single tree. We illustrate the potential of our algorithm by analyzing the conservation pattern of the potassium channel protein family.

bModelTest: Bayesian phylogenetic site model averaging and model comparison

Article Open access 06 February 2017

Causes of evolutionary rate variation among protein sites

Article 19 January 2016

The Influence of Protein Stability on Sequence Evolution: Applications to Phylogenetic Inference

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The degree to which an amino acid site is free to vary is strongly dependent on its structural and functional importance. An amino acid that plays an essential role, such as one within the active site of the protein, is unlikely to change over evolutionary time. Hence, the evolutionary rate at an amino acid site is indicative of how conserved this site is and, in turn, allows evaluating the importance of this site in maintaining the structure/function of the protein (Lichtarge and Sowa 2002). In silico detection of conserved regions can thus focus mutagenesis experiments on functionally important residues (e.g., Donaudy et al. 2003).

Conservation levels are typically inferred from a multiple sequence alignment of homologous proteins. Numerous methods for detecting site-specific conservation have been previously proposed. Nineteen such scores, developed in the last 30 years, have recently been reviewed by Valdar (2002). Though evolution is the driving force which determines site conservation, hardly any of these methods takes the evolutionary relationships among the sequences (phylogeny) into account (but see Lichtarge et al. 1996; Armon et al. 2001).

Conservation levels and evolutionary rates are, in fact, synonymous. A conserved site is slow evolving while a variable site evolves rapidly. It is this observation that places the problem of conservation-score estimation in the realm of probabilistic evolutionary models. Although probabilistic models are extensively used in phylogenetic and in molecular evolutionary studies, they have only recently been applied for evaluating site-specific evolutionary rates in proteins (Dean and Golding 2000; Pupko et al. 2002). Such methods have been shown to be superior to those reviewed by Valdar (2002) since they take into account the branch lengths of the phylogenetic tree and make use of explicit probabilistic-based evolutionary models (Pupko et al. 2002). We note that evolutionary rates are commonly measured as the number of replacements per amino acid site per year. Here we define a site-specific rate as the evolutionary rate of the site relative to the average evolutionary rate across all sites. Site-specific rates are hence unitless.

All algorithms of rate inference presented to date are composed of two basic steps: (1) construct the phylogenetic tree and (2) infer site-specific rates. Various inference methods differ in the manner in which step 2 is performed, while all rely on the assumption that the phylogeny is absolutely correct. Such an approach may lead to erroneous results if the inferred phylogeny does not reflect the true underlying evolutionary relationships among the sequences. As demonstrated in Fig. 1, inferred rates might be substantially different under two different topologies. An alternative approach would consider the phylogeny as an additional parameter of the model and would compute site-specific rates taking into account alternative tree topologies. More accurate predictions of evolutionary rates are thus expected since the inherent uncertainty in the phylogeny is integrated in the computation. Such a comprehensive evolutionary approach calls for the use of Bayesian phylogenetics.

Bayesian estimation of phylogeny is based upon the posterior probability distribution of trees. The posterior probability distribution for an unrooted phylogenetic tree involves a huge number of tree topologies [(2n−5)!/2ⁿ⁻³(n−3)! unrooted trees for n sequences (Felsenstein 2004)]. For each such tree, there are infinite possibilities of branch length combinations. This parameter space is too complex to solve analytically. Markov chain Monte Carlo (MCMC) (Metropolis et al. 1953; Hastings 1970) is a numerical method that can be used for Bayesian inference from this complex parameter space. MCMC is firmly grounded in probability theory (see, e.g., Gelman et al. 1995) and has recently been applied in phylogenetic studies (Yang and Rannala 1997; Larget and Simon 1999; Mau et al. 1999; Li et al. 2000; Huelsenbeck and Ronquist 2001; McGuire et al. 2001; Jow et al. 2002). In this paper we describe a novel Bayesian method for site-specific rate estimation. Using MCMC, we integrate over the space of all possible tree topologies, branch lengths, and evolutionary model parameters to obtain site-specific rate estimates that account for the stochastic nature of the underlying evolutionary process. We show that by doing so a significant improvement in site-specific rate estimation is achieved.

Methods

The Evolutionary Model

In this study, the JTT model of amino acid replacements (Jones et al. 1992) with among-site rate variation, as specified by a gamma distribution, is used. The only additional parameter required by this model is the shape parameter of the gamma distribution, α. Thus, the free parameters in our model are τ, t, and α, which are the tree topology, the branch lengths, and the among-site rate variation parameter, respectively.

Site-Specific Rate Estimation Given a FixedPhylogenetic Tree

A phylogenetic tree T= (τ, t) is defined by its tree topology τ and associated branch lengths t. The branch lengths of the phylogenetic tree represent the average evolutionary rate across all sites. A relative site-specific rate r indicates how fast this site evolves relative to the average rate across all sites. For example, a rate of 2.0 indicates a site that evolves two times faster than the average. Thus, site-specific rates inferred here are not absolute evolutionary rates that require knowledge of divergence times but, rather, represent a comparative quantity.

Within the Bayesian framework, the posterior probability of any given rate r is obtained from the likelihood function and the prior probability. The most commonly chosen prior distribution for modeling rate variation across sites is the gamma distribution. The gamma distribution has a single parameter, α, which determines its shape (Swofford et al. 1996; Yang 1996). Given a tree T and a parameter α, the posterior probability for any given rate r is obtained using Bayes’ (1763) theorem

$$ P(r|data,T,\alpha )={{P(data|r,T)P(r|\alpha )} \over {\int\limits_{r'=0}^\infty {P(data|r',T)P(r'|\alpha )dr'} }} $$

(1)

where the likelihood P(data|r,T) can be calculated using Felsenstein’s (1981) postorder tree traversal algorithm, and P(r|α) is the prior distribution on the rates. A detailed description of the calculation is given by Mayrose et al. (2004). We denote by P both probabilities and densities, depending on the context. The gamma distribution is approximated by a discrete rate distribution with k = 16 categories (Yang 1994). For the task of site-specific rate inference this number of categories sufficiently approximates the continuous gamma distribution (Mayrose et al. 2004). After the discrete approximation Eq. (1) becomes

$$ P(r=r_i |data,T,\alpha ) \cong {{P(data|r_i ,T)} \over {\sum\limits_{j=1}^k {P(data|r_j ,T)} }} $$

(2)

The prior probabilities in Eq. (1) are canceled out since all rate categories have equal prior probabilities (l/k).

The goal is to estimate the site-specific rates for all positions. Our site-specific rate estimate is defined as the expectation of r over its posterior rate distribution:

$$ E(r|data,T,\alpha ) \cong \sum\limits_{j=1}^k {r_j P(r_j |data,T,\alpha )} $$

(3)

This estimate was shown to be more accurate than the maximum likelihood (ML) estimate that assumes no prior rate distribution (Mayrose et al. 2004).

Evolutionary-Rate Inference over the Entire Tree Space

Equation (3) can be read as “the expectation of r given that the tree topology is τ, the set of associated branch lengths is t, and the shape of the gamma distribution is α.” In the case where the combined state ω = {τ, t, α} is unknown, we can use the law of total probability to obtain a site-specific rate estimate over the whole tree space and over all α values:

$$ E(r|data)=\sum\limits_{i=1}^{C_s } {\int\limits_{t_i } {\int\limits_a {E(r|data,\tau _i ,t_i,\alpha )P(\tau _i ,t_i,\alpha |data)dt_i d\alpha {\rm{ }}} } } $$

(4)

The symbol τ_i labels the ith tree topology, t_i is the set of branch lengths associated with this topology, and C_s is the number of possible topologies for a data set containing S sequences. According to Bayes’ law, the second term on the right-hand side of Eq. (4) can be expressed as

$$ P(\tau _i ,t_i ,\alpha |data)={{P(data|\tau _i ,t_i ,\alpha )P(\tau _i ,t_i ,\alpha )} \over {\sum\limits_{j=1}^{C_s } {\int\limits_{t_j } {\int\limits_{\alpha '} {P(data|\tau _i ,t_j ,\alpha ')P(\tau _j ,t_j ,\alpha ')dt_j d\alpha '} } } }} $$

(5)

Each one of the expressions within Eq. (5) can be readily computed. However, the enumeration over all possible tree topologies and, for each topology, the integration over all possible combinations of branch lengths and α values is intractable for realistic-sized problems. MCMC is therefore used to generate a large sample from the posterior probability distribution of states without the explicit computations of these sums and integrals.

MCMC

The basic Metropolis–Hastings MCMC algorithm (Metropolis et al. 1953; Hastings 1970; Gelman et al. 1995) follows a two-step process. First, a new state, ω*, is proposed according to some stochastic proposal mechanism. Second, the new state is either accepted or rejected according to a transition probability. If the proposed state, ω*, is accepted, it becomes the next state of the chain, ωⁿ⁺¹. Otherwise the chain remains in its current state and so ωⁿ⁺¹ = ωⁿ. The transition probability is defined as

$$ \eqalign{P(\omega ^* |\omega^{\prime\prime}) &\equiv \min \left[ 1, \underbrace {{P(\omega ^* |data)} \over {P(\omega ^n |data)}}_{\rm Posterior\ ratio} \times \underbrace{{f(\omega ^n |\omega ^*)} \over {f(\omega ^* |\omega ^n )}}_{{\rm Hasting\ ratio}} \right] \cr &= {\min \left[ {1,\underbrace {{{P(data|\omega ^* )} \over {P(data|\omega ^n )}}}_{{\rm{Likelihood\ ratio}}} \times \underbrace {{{P(\omega ^* )} \over {P(\omega ^n )}}}_{{\rm{Prior\ ratio}}} \times \underbrace {{{f(\omega ^n |\omega ^* )} \over {f(\omega ^* |\omega ^n )}}}_{{\rm{Hasting\ ratio}}}} \right]}} $$

(6)

f(ω*|ωⁿ) is the probability of proposing the new state given the old one and f(ωⁿ|ω*) is the probability of the reverse move, which is not actually made. These two terms are calculated based on the specific proposal mechanism implemented, which defines how a new state will be proposed given the current state of the chain. The advantage of the formulation in Eq. (6) is that the complex denominator in Eq. (5) is canceled out. The above formula reduces to three ratios, each of which can be readily calculated. Note that if the right-hand term in Eq. (6) is bigger than 1.0, the move is always accepted. This reflects a general tendency of the chain to go “uphill” when possible and to go “downhill” only occasionally. The starting state of the chain, ω⁰, is chosen randomly from the entire parameter space. The points sampled during an initial portion of the chain (called burn-in) are discarded since they are still characteristic of the starting point and do not reflect properly the posterior distribution. The proportion of the time any single state is visited after the burn-in stage is a valid approximation of its posterior probability.

For each state ω in the sample, a site-specific rate estimation is calculated (Eq. [3]). The result of the algorithm is a rate estimate ($\hat {r} $) over all states sampled. This estimate can be expressed as

$$ \hat r={1 \over N}\sum\limits_{i=1}^N {E(r|\omega ^i ,data)} $$

(7)

where N is the number of sampled states. Note that this expression is different from the estimate presented in Eq. (4). In Eq. (4) a rate distribution over the entire tree space is first obtained, and only then is the expectation derived. However, the MCMC algorithm ensures that as N becomes large enough, $ \hat {r} $ converges to the expectation in Eq. (4).

MCMC Implementation

In each step of the Markov chain a new state is proposed according to a predefined proposal mechanism. We have implemented four different types of moves (described below). At each step we choose one of these moves randomly according to a predefined distribution. All move types are symmetrical, i.e., the probability of moving from state X to state Y is the same as the probability of the reverse move from Y to X. The Hastings ratio in Eq. (6) thus equals 1.0.

The nearest-neighbor interchange (NNI) proposal randomly selects an internal branch of the tree. It then randomly interchanges two of four “neighboring” subtrees, one from each end of the internal branch. The lengths of all branches are kept unchanged. The NNI proposal, or equivalent variations of it, is currently employed in all four MCMC computer programs available for phylogeny (Larget and Simon 1999; Huelsenbeck and Ronquist 2001; McGuire et al. 2001; Jow et al. 2002).

A second proposal mechanism changes the length of a randomly chosen branch according to a sliding window mechanism: a window of some fixed width, δ, is placed around the current length of the branch, x. δ is a tuning parameter. The proposed length, x*, is then chosen uniformly from the interval (x − δ, x + δ). If x* becomes negative an NNI move is employed, and the length of the proposed branch is set to |x*|. If the branch is an external branch, then the topology of the tree remains the same. This proposal mechanism mainly results in branch length changes but can also induce a local topology change via an NNI move (see also Jow et al. 2002).

The third proposal mechanism modifies all branch lengths of the tree simultaneously. For each branch a number u is randomly drawn from the interval (1, 1 + ε), where ε is a tuning parameter. For a branch of length d, a new length, d^*, proposed with equal probability to be either (d × u) or (d / u). The fourth proposal mechanism modifies the gamma distribution parameter α to be either (α × u) or (α / u), where u is randomly drawn from the interval (1, 1 + ζ).

The values of the tuning parameters (δ, ε, and ζ) need to be carefully chosen for an efficient MCMC algorithm to traverse the entire parameter space. As a rule of thumb, the acceptance rate should be between 20 and 60% to provide a good mixing of the data (Huelsenbeck 2000). In the present implementation, the starting value for each tuning parameter is 0.1. During the burn-in period, each tuning parameter is increased or decreased depending on the acceptance rate of the move it controls, so that the acceptance rate will be between 20 and 60%.

A practical problem associated with MCMC is to determine how many steps are necessary in order to obtain a good approximation of the posterior distribution. The most useful diagnosis is to run multiple independent chains each with a different starting point (Huelsenbeck et al. 2002). If these chains converge to the same estimated rates, it is a strong indication that the chains have appropriately sampled the parameter space. Here convergence is defined when all pairwise correlation coefficients between the inferred rates from all chains are higher than 0.99. The rates inferred by the independent chains are then averaged to produce final rate estimates. A second diagnostic was performed to ensure that, at all sites, estimated rates have reached their limiting values. We therefore test if all rate estimates are restricted to a small interval of size ε for more than M steps. In all runs conducted, ε and M were set to 0.01 and 800, respectively. These values appear to balance between computation time limitations (which calls for a large ε and a small M) and precision (which calls for a small s and a large M). According to this diagnostic tool the run is halted when all sites have converged to their limiting values. Combining the two tests described above, the run is halted when both diagnostic criteria are satisfied.

Prior Probabilities

In order to calculate the transition probability between states (Eq. [6]) a prior distribution for ω must be specified. Since there is no a priori biological justification for supporting any particular prior, a simple factorized prior P(ω) = P(τ_i)P(t_i)P(α) was chosen (as in Jow et al. 2002). A discrete uniform prior was set over topologies such that P(τ_i) = 1/C_s. Continuous uniform priors were given for branch lengths and α. The interval of possible branch lengths was set to (0, 5) while the interval for α was set to (0, 10). This choice ensures that all reasonable values of the parameters are reachable.

The Computer Program

The MCMC rate-inference algorithm described here is implemented in the program McRate, written in C++. The program is available at http://www.tau.ac.il/∼talp/MCMC/McRate.html. The sole obligatory input to McRate is a multiple sequence alignment file. The program allows users to specify a number of optional parameters such as the burn-in period and number of chains.

Simulations

Simulations were used in order to compare the accuracy of the rates inferred by McRate and those inferred by an empirical Bayesian approach, in which inference is based on a single phylogenetic tree (i.e., rates are inferred using Eq. [3]). We refer to this single tree method as ST. Our simulation runs were conducted using two different schemes. The first simulation set tested the effect of different rate distributions, while the second set tested different model trees.

For the first set, we chose a model tree with 17 operational taxonomic units (OTUs) (Fig. 2a). This tree was chosen because it was shown to be a difficult case for phylogeny inference (Alfaro et al. 2003). The rate at each position was drawn from a specified rate distribution. Three different sets of rate distributions were examined: (1) a gamma distribution with α = 0.3 that represents a severe among-site rate variation, (2) a gamma distribution with α = 1.0 that represents a case of little among-site rate variation, and (3) an empirical distribution inferred by ST from a multiple sequence alignment of 57 potassium channel proteins (see Biological Results, below). This distribution can be considered a realistic one because it is based on a large number of homologs and because of the good quality of the alignment. In all runs, the simulated rates were scaled so that the average equals 1.0.

In the second set of simulations, we used model trees with different number of OTUs. In addition to the 17-OTU model tree used in the first simulation set, we generated two trees with 7 and 27 OTUs (Figs. 2b and c). For this set of simulations, the rates were drawn from a gamma distribution with α = 0.3. In all runs, the simulated rates were scaled so that the average equals 1.0.

Site-specific rates were drawn from the given rate distribution and were assigned to each site. Protein sequences were then generated by simulating evolutionary changes along the branches of the given model tree. The simulation used the JTT model of amino acid replacement (Jones et al. 1992), in which each site evolves independently. For each run, 100 sites were generated in this manner. The generated sequences were given as input to both McRate and ST. ST requires for its computations an assumed phylogenetic tree and a given α parameter. Two different tree reconstruction algorithms were examined: (1) ST-ML, in which an ML tree was constructed using the Semphy program (Friedman et al. 2002) and (2) ST-NJ, in which the tree was constructed according to the neighbor-joining (NJ) algorithm (Saitou and Nei 1987) with pairwise distances estimated by ML. Branch lengths in the resulting tree were then optimized using ML. In both cases the α parameter was inferred from the data by maximizing P(data| α,T) using a 16-category discrete gamma distribution (Yang 1994). For each simulation condition studied (e.g., a 7-OTU tree with α = 0.3) a total of 30 identical and independent simulation runs were conducted.

In each simulation run, three vectors of rates were inferred: one vector by McRate and two by ST (ST-NJ and ST-ML). The accuracy of inference was analyzed by the sum-of-squares (SOS) distance between the simulated rates and the rates inferred by each method. The SOS distances obtained from McRate and ST are dependent because, for each run, inferences were performed based on the same simulated data. For each simulation condition tested, 30 SOS measures were obtained for each inference method. A two-sided Wilcoxon nonparametric test between two dependent samples (Sokal and Rohlf 1981) was then performed in order to check whether the inference techniques attain comparable accuracy. A nonparametric test was used to eliminate the assumption that the SOS measures are normally distributed. The null hypothesis is that the two methods produce equal results. Rejection of the null hypothesis indicates that the rates inferred with one of the methods are significantly more accurate than those inferred by the other.

Results

Simulation Results

A comparison between the inference accuracy of McRate and the two ST methods for different number of OTUs is shown in Table 1. In all cases McRate is found to be the most accurate method, while ST-NJ seems the least accurate one. This difference is significant in all but one case (Table 1). The simulation shows that the accuracy of all three methods increases as the number of OTUs increases. This finding is expected since more data are available at each position for rate inference.

Table 1 Simulation results: Model trees with different numbers of OTUs^a

Full size table

The simulation results obtained with different rate distributions showed a similar trend (Table 2). Mc-Rate is superior to both ST methods under all distributions, although this superiority is not always statistically significant. As Table 2 shows, the shape of the rate distribution influences the accuracy of rate inference. Mayrose et al. (2004) have recently shown that the accuracy of prediction decreases when the amount of among-site rate variation increases (small α values in the gamma distribution). Our results here show that McRate superiority over ST is more noticeable in these cases (Table 2). The difference between McRate and ST-ML is 0.23 and 0.04 when simulating with α = 0.3 and α = 1.0, respectively. Thus, when inference accuracy is less reliable, McRate superiority is more pronounced. In order to obtain a conclusive conclusion regarding McRate’s advantage, we pooled the data from all simulation scenarios (Tables 1 and 2). McRate was found to be significantly superior over both ST methods (p < 0.00001) for this comprehensive data set.

Table 2 Simulation results: Different rate distributions^a

Full size table

McRate integrates both over trees and α values. Another set of simulations was constructed in order to identify which of these factors contributes most to the accuracy. Three MCMC schemes were compared. (1) In McRate, the integration is over all parameters. (2) In McRate_Tree, the integration is over trees only, keeping the α parameter constant. The α used is the mean value estimated using the MCMC integration over all parameters, i.e., the resulting α estimate of McRate above. (3) In McRate_Alpha, the integration is over α values only. The tree was inferred by ML and was kept constant. Table 3 shows that the SOS scores of the McRate and McRate_Tree are almost identical. It seems that in most cases averaging over topologies is the main effect responsible for the greater reliability of the rate estimates.

Table 3 Simulation results obtained using different MCMC schemes

Full size table

Biological Results: The Potassium Channel

Potassium channels are tetrameric integral membrane proteins that form transmembrane aqueous pores through which K⁺ ions can flow. Potassium channels take part in many different cellular processes including cell volume regulation, hormone secretion, and electrical impulse formation in electrically excitable cells (MacKinnon 2003). The most fundamental role carried out by all K⁺ channels is to allow the rapid permeation of K⁺ions while rejecting, with extreme efficiency, the smaller Na⁺ ions (or other potential competitors). The solved three-dimensional (3D) structures of a bacterial K⁺ channel (Doyle et al. 1998; Jiang et al. 2002) have clarified the mechanism of selective ion transfer across the membrane.

We used McRate to study the conservation pattern of the potassium channel protein family. Fifty-seven homologous sequences of the Streptomyces lividans potassium channel, for which the 3D structure is known (PDB ID: 1bl8 [Doyle et al. 1998]), were used in the analysis. The homologous sequences were obtained by a BLAST search (Altschul et al. 1997) against the SwissProt database (http://us.expasy.org/sprot/). A multiple sequence alignment of these homologs was built using CLUSTALW (Thompson et al. 1994). The alignment was given as input to the McRate program. Three independent chains with burn-in period of 10,000 steps were run until convergence was reached (see Methods). The conservation scores were then projected onto the 3D structure. For this projection the continuous rates estimated are mapped into nine different colors (bins), as in the ConSurf server (Glaser et al. 2003). The range of each bin varies so that each one contains one-ninth of the sites. We define highly conserved sites as those that fall in bin 8 or 9. A total of 98 sites were visualized, which corresponds to the length of the sequence in the PDB entry.

The conservation pattern obtained by McRate correlates well with the known functional features that contribute to the channel’s high potassium selectivity and throughput. A well-conserved surface patch of residues (all in the most conserved bin) is found at the extracellular entryway (Fig. 3a). This patch functions as the selectivity filter of the channel. The backbone of this signature sequence forms a rigid opening into which K⁺ ions fit precisely, but in which the smaller Na⁺ ions would fit only loosely (Miller 2000). Not surprisingly, mutating these amino acids disrupts the channel’s ability to discriminate between K⁺ and Na⁺ ions (Heginbotham et al. 1994). A second conserved region is formed by residues of medium to high conservation levels (Fig. 3b). This region forms the inner vestibule, which is lined mostly by hydrophobic residues. This hydrophobic lining provides the diffusing K⁺ ion with a direct inert path through the membrane (Doyle et al. 1998). The medium conservation levels can be explained since the residues are bound only by a hydrophobic constraint. McRate also detected some highly conserved sites other than those forming the two patches above. Noteworthy, these conserved residues are known to be important in maintaining the function and structure of the channel (e.g., sites involved in interhelical contacts or in the allosteric mechanism of pore closing and opening).

In order to evaluate McRate’s potential advantage over existing methods, the conservation pattern of the potassium channel was studied with two additional methods. First, we used ST-ML—the second-best performing method in our simulations. Second, we estimated the rate at each site with the maximum parsimony (MP) score calculated on the ML tree. The MP method represents a very fast yet naïve approach for rate estimation. The conservation analysis obtained using ST-ML identified approximately the same conserved patches as McRate. The similarity between these two methods is also reflected in the high correlation between the estimated rates (r² = 0.934). We note, however, that 35 of 98 sites were assigned to a different conservation bin with differences spanning up to three bins. Some of these differences were located in functionally important domains.

Results obtained from the MP method were substantially different from those calculated by either McRate or ST-ML. The main conserved patch was only partially detected. Moreover, the patch comprising the hydrophobic lining could not be identified due to its average conservation. Additionally, only 16 sites fell into the same conservation bins as in the analysis performed by McRate. The calculation time required by the three methods varied substantially. McRate’s analysis took about a day, and only a few minutes for ST-ML and MP (using Pentium 4, 2.40 GHz, with 512 MB of RAM).

Discussion

The MCMC approach presented here has a number of advantages. It allows us to effectively integrate over all possible trees and model parameters. MCMC samples from the entire phylogeny space, rather than relying on a single best tree. Moreover, prior distributions are assumed for all parameters of the evolutionary model (e.g., the gamma shape parameter, α). The inference of evolutionary rates is then based on all possible values of the parameters in addition to all possible trees.

The simulation results indicated that the MCMC approach, as implemented with the computer program McRate, is superior to methods that rely on a single tree. McRate and the ST method utilize the same probabilistic approach for computing site-specific rates, i.e., the expectation over the posterior rate distribution. Therefore, McRate’s improved accuracy clearly arises from considering different evolutionary scenarios rather than the particular rate computation method implemented. McRate advantage was verified for different model trees and different distributions of simulated rates. Our simulations revealed that the performance of ST is, at best, similar to McRate under some scenarios. Our simulations further showed that when ST is based on the ML tree, rather than on the NJ tree, better results are obtained. This difference evidently arises from using a better tree-inference technique.

When presenting the first Bayesian rate-inference technique for DNA sequences, Yang and Wang (1995) found a very high correlation between the rates inferred from an ML tree and those obtained by using a star-like tree. They have subsequently argued that the prediction of evolutionary rates is tolerant to errors in phylogenetic tree reconstruction. This means that inferred rates would be highly similar, no matter which tree is assumed. If this hypothesis proves factual then averaging over many possible trees will have little effect on the predicted rates. However, their conclusion was based on 4-OTU trees only. Yang and Wang’s (1995) conclusion is compatible with our results obtained when 4-OTU trees were simulated (results not shown). In these cases the correlations between rates inferred by McRate and ST were extremely high (r > 0.99), which means that the differences between the two methods are trivial. However, upon inclusion of additional taxa, our simulations showed that the topology has a substantial effect on the estimated rates.

McRate’s capabilities for predicting functionally important protein regions were demonstrated using the thoroughly studied potassium channel protein family. Both McRate and ST-ML successfully recovered the known functional regions. The difference is limited to specific sites that are assigned to different conservation bins. However, in light of McRate’s superiority in most simulation schemes it is likely that its predication regarding the potassium channel is more accurate. The MP analysis demonstrated that such a naïve method is too simplistic for real biological examples. Our results indicate that even the most conserved area was only partly recovered. Indeed, poor performance was also observed when using the MP score in all simulations schemes (results not shown).

Bayesian methods in phylogeny were recently criticized in the context of overestimating the Baye-sian support for internal nodes as compared with the traditional bootstrap and jackknife techniques (Simmons et al. 2004; Suzuki et al. 2002). In this study, however, the Bayesian technique is used only to obtain a large set of plausible trees and not to produce a measure of support to one single best tree.

We expect that in cases where the phylogenetic tree is hard to recover (short sequences, many gapped positions, etc.), the differences between MCMC and the single-tree approach will intensify. Practically, McRate is time-consuming and should be the tool of choice when there are indications that the inferred phylogenetic tree is unreliable.

References

ME Alfaro S Zoller F Lutzoni (2003) ArticleTitleBayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence Mol Biol Evol 20 255–266 Occurrence Handle1:CAS:528:DC%2BD3sXhsVOiuro%3D Occurrence Handle12598693
CAS PubMed Google Scholar
SF Altschul TL Madden AA Schaffer J Zhang Z Zhang W Miller DJ Lipman (1997) ArticleTitleGapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25 3389–3402 Occurrence Handle10.1093/nar/25.17.3389 Occurrence Handle1:CAS:528:DyaK2sXlvFyhu7w%3D Occurrence Handle9254694
Article CAS PubMed Google Scholar
A Armon D Graur N Ben-Tal (2001) ArticleTitleConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information J Mol Biol 307 447–463 Occurrence Handle1:CAS:528:DC%2BD3MXhslSnsb8%3D Occurrence Handle11243830
CAS PubMed Google Scholar
T Bayes (1763) ArticleTitleAn essay toward solving a problem in the doctrine of chances Philos Trans London 53 370–418
Google Scholar
Dean AM, Golding GB (2000) Enzyme evolution explained (sort of). Pac Symp Biocomput6 17
F Donaudy A Ferrara L Esposito R Hertzano O Ben-David RE Bell S Melchionda L Zelante KB Avraham P Gasparini (2003) ArticleTitleMultiple mutations of MYO1A, a cochlear-expressed gene, in sensorineural hearing loss Am J Hum Genet 72 1571–1577 Occurrence Handle1:CAS:528:DC%2BD3sXktlyitLY%3D Occurrence Handle12736868
CAS PubMed Google Scholar
DA Doyle J Morals Cabral RA Pfuetzner A Kuo JM Gulbis SL Cohen BT Chait R MacKinnon (1998) ArticleTitleThe structure of the potassium channel: molecular basis of K+ conduction and selectivity Science 280 69–77 Occurrence Handle10.1126/science.280.5360.69 Occurrence Handle1:CAS:528:DyaK1cXitlWksrY%3D Occurrence Handle9525859
Article CAS PubMed Google Scholar
J Felsenstein (1981) ArticleTitleEvolutionary trees from DNA sequences: a maximum likelihood approach J Mol Evol 17 368–376 Occurrence Handle1:CAS:528:DyaL3MXls1Cisr8%3D Occurrence Handle7288891
CAS PubMed Google Scholar
J Felsenstein (2004) Inferring phylogenies Sinauer Associates Sunderland, MA
Google Scholar
N Friedman M Ninio I Pe’er T Pupko (2002) ArticleTitleA structural EM algorithm for phylogenetic inference J Comput Biol 9 331–353 Occurrence Handle1:CAS:528:DC%2BD3sXotVSqsLk%3D Occurrence Handle12015885
CAS PubMed Google Scholar
AJ Gelman H Carlin InstitutionalAuthorName Stern D Rubin (1995) Bayesian data analysis Chapman and Hall London
Google Scholar
F Glaser T Pupko I Paz RE Bell D Bechor-Shental E Martz N Ben-Tal (2003) ArticleTitleConSurf: Identification of functional regions in proteins by surface-mapping of phylogenetic information Bioinformatics 19 163–164 Occurrence Handle1:CAS:528:DC%2BD3sXltF2rtQ%3D%3D Occurrence Handle12499312
CAS PubMed Google Scholar
W Hastings (1970) ArticleTitleMonte Carlo sampling methods using Markov chains and their applications Biometrika 57 97–109
Google Scholar
L Heginbotham Z Lu T Abramson R MacKinnon (1994) ArticleTitleMutations in the K+ channel signature sequence Biophys J 66 1061–1067 Occurrence Handle1:CAS:528:DyaK2cXjtF2lu7c%3D Occurrence Handle8038378
CAS PubMed Google Scholar
Huelsenbeck JP (2000) Likelihood-based inference of phylogeny. Marine biological laboratory workshop on molecular evolution: Lectures
Google Scholar
JP Huelsenbeck F Ronquist (2001) ArticleTitleMRBAYES: Bayesian inference of phylogenetic trees Bioinformatics 17 754–755 Occurrence Handle10.1093/bioinformatics/17.8.754 Occurrence Handle1:STN:280:DC%2BD3MvotV2isw%3D%3D Occurrence Handle11524383
Article CAS PubMed Google Scholar
JP Huelsenbeck B Larget RE Miller F Ronquist (2002) ArticleTitlePotential applications and pitfalls of Bayesian inference of phylogeny Syst Biol 51 673–688 Occurrence Handle12396583
PubMed Google Scholar
Y Jiang A Lee J Chen M Cadene BT Chait R MacKinnon (2002) ArticleTitleThe open pore conformation of potassium channels Nature 417 523–526 Occurrence Handle1:CAS:528:DC%2BD38XktVeht7k%3D Occurrence Handle12037560
CAS PubMed Google Scholar
DT Jones WR Taylor JM Thornton (1992) ArticleTitleThe rapid generation of mutation data matrices from protein sequences Comput Appl Biosci 8 275–282 Occurrence Handle1:CAS:528:DyaK38Xlt1Okt7w%3D Occurrence Handle1633570
CAS PubMed Google Scholar
H Jow C Hudelot M Rattray PG Higgs (2002) ArticleTitleBayesian phylogenetics using an RNA substitution model applied to early mammalian evolution Mol Biol Evol 19 1591–1601 Occurrence Handle1:STN:280:DC%2BD38vkt12jtg%3D%3D Occurrence Handle12200486
CAS PubMed Google Scholar
B Larget D Simon (1999) ArticleTitleMarkov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees Mol Biol Evol 16 750–759 Occurrence Handle1:CAS:528:DyaK1MXjslGitb8%3D
CAS Google Scholar
S Li DK Pearl H Doss (2000) ArticleTitlePhylogenetic tree construction using Markov chain Monte Carlo J Am Stat Assoc 95 493–508
Google Scholar
O Lichtarge ME Sowa (2002) ArticleTitleEvolutionary predictions of binding surfaces and interactions Curr Opin Struct Biol 12 21–27 Occurrence Handle1:CAS:528:DC%2BD38XhtF2it7o%3D Occurrence Handle11839485
CAS PubMed Google Scholar
O Lichtarge HR Bourne FE Cohen (1996) ArticleTitleAn evolutionary trace method defines binding surfaces common to protein families J Mol Biol 257 342–358 Occurrence Handle1:CAS:528:DyaK28XhvVOqsLw%3D Occurrence Handle8609628
CAS PubMed Google Scholar
R MacKinnon (2003) ArticleTitlePotassium channels FEES Lett 555 62–65 Occurrence Handle1:CAS:528:DC%2BD3sXptVWgtbo%3D
CAS Google Scholar
B Mau MA Newton B Larget (1999) ArticleTitleBayesian phylogenetic inference via Markov chain Monte Carlo methods Biometrics 55 1–12 Occurrence Handle1:STN:280:DC%2BD3M3ntV2qsw%3D%3D Occurrence Handle11318142 Occurrence HandleMR1705672
CAS PubMed MathSciNet Google Scholar
I Mayrose D Graur N Ben-Tal T Pupko (2004) ArticleTitleComparison of site-specific rate-inference methods for protein sequences: Bayesian methods are superior Mol Biol Evol 21 1781–1791 Occurrence Handle1:CAS:528:DC%2BD2cXntVCitb0%3D Occurrence Handle15201400
CAS PubMed Google Scholar
G McGuire MC Denham DJ Balding (2001) ArticleTitleMACS: Bayesian inference of phylogenetic trees from DNA sequences incorporating gaps Bioinformatics 17 479–480 Occurrence Handle1:CAS:528:DC%2BD3MXktlyiurY%3D Occurrence Handle11331243
CAS PubMed Google Scholar
N Metropolis A Rosenbluth M Rosenbluth A Teller E Teller (1953) ArticleTitleEquation of state calculations by fast computing machines J Chem Phys 21 1087–1092 Occurrence Handle10.1063/1.1699114 Occurrence Handle1:CAS:528:DyaG3sXltlKhsw%3D%3D
Article CAS Google Scholar
C Miller (2000) ArticleTitleAn overview of the potassium channel family Genome Biol 1 REVIEWS0004 Occurrence Handle1:STN:280:DC%2BD3MvotlCisw%3D%3D Occurrence Handle11178249
CAS PubMed Google Scholar
T Pupko RE Bell I Mayrose F Glaser N Ben-Tal (2002) ArticleTitleRate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues Bioinformatics 18 71–77
Google Scholar
N Saitou M Nei (1987) ArticleTitleThe neighbor-joining method: a new method for reconstructing phylogenetic trees Mol Biol Evol 4 406–425 Occurrence Handle1:STN:280:BieC1cbgtVY%3D Occurrence Handle3447015
CAS PubMed Google Scholar
MP Simmons KM Pickett M Miya (2004) ArticleTitleHow meaningful are Bayesian support values? Mol Biol Evol 21 188–199 Occurrence Handle1:CAS:528:DC%2BD2cXhvVKqsLw%3D Occurrence Handle14595090
CAS PubMed Google Scholar
RR Sokal FJ Rohlf (1981) Biometry: the principles and practice of statistics in biological research W.H. Freeman New York
Google Scholar
Y Suzuki GV Glazko M Nei (2002) ArticleTitleOvercredibility of molecular phylogenies obtained by Bayesian phylogenetics Proc Natl Acad Sci USA 99 16138–16143 Occurrence Handle1:CAS:528:DC%2BD38Xps1ensLw%3D Occurrence Handle12451182
CAS PubMed Google Scholar
DL Swofford GJ Olsen PJ Waddell DM Hillis (1996) Phylogenetic inference DM Hillis C Moritz BK Mable (Eds) Molecular systematics Sinauer Associates Sunderland, MA 407–514
Google Scholar
JD Thompson DG Higgins TJ Gibson (1994) ArticleTitleCLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22 4673–4680 Occurrence Handle1:CAS:528:DyaK2MXitlSgu74%3D Occurrence Handle7984417
CAS PubMed Google Scholar
WS Valdar (2002) ArticleTitleScoring residue conservation Proteins 48 227–241 Occurrence Handle1:CAS:528:DC%2BD38Xlt1Wqu7k%3D Occurrence Handle12112692
CAS PubMed Google Scholar
Z Yang (1994) ArticleTitleMaximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods J Mol Evol 39 306–314 Occurrence Handle1:CAS:528:DyaK2cXmt1eit7c%3D Occurrence Handle7932792
CAS PubMed Google Scholar
Z Yang (1996) ArticleTitleAmong-site rate variation and its impact on phylogenetic analyses Trends Ecol Evol 11 367–372 Occurrence Handle10.1016/0169-5347(96)10041-0
Article Google Scholar
Z Yang B Rannala (1997) ArticleTitleBayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method Mol Biol Evol 14 717–724 Occurrence Handle1:CAS:528:DyaK2sXksVahs7k%3D Occurrence Handle9214744
CAS PubMed Google Scholar
Z Yang T Wang (1995) ArticleTitleMixed model analysis of DNA sequence evolution Biometrics 51 552–561 Occurrence Handle1:STN:280:ByqA1M7osVM%3D Occurrence Handle7662844
CAS PubMed Google Scholar

Download references

Acknowledgments

We thank Dan Graur for his insightful remarks. We thank Sarel Fleihman for his help on the potassium channel example. T.P. is supported by a grant in Complexity Science from the Yeshaia Horvitz Association and by a grant from the Israel Science Foundation number 1208/04. We thank two anonymous referees and the associated editor for their insightful comments and suggestions.

Author information

Authors and Affiliations

Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv, 69978, Israel
Itay Mayrose & Tal Pupko
Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 76100, Israel
Amir Mitchell

Authors

Itay Mayrose
View author publications
You can also search for this author in PubMed Google Scholar
Amir Mitchell
View author publications
You can also search for this author in PubMed Google Scholar
Tal Pupko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tal Pupko.

Additional information

Itay Mayrose, Amir Mitchell contributed equal.

Reviewing Editor : Dr. Nicolas Galtier

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mayrose, I., Mitchell, A. & Pupko, T. Site-Specific Evolutionary Rate Inference: Taking Phylogenetic Uncertainty into Account. J Mol Evol 60, 345–353 (2005). https://doi.org/10.1007/s00239-004-0183-8

Download citation

Received: 14 June 2004
Accepted: 09 September 2004
Issue Date: March 2005
DOI: https://doi.org/10.1007/s00239-004-0183-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Site-Specific Evolutionary Rate Inference: Taking Phylogenetic Uncertainty into Account

Abstract

Similar content being viewed by others

bModelTest: Bayesian phylogenetic site model averaging and model comparison

Causes of evolutionary rate variation among protein sites

The Influence of Protein Stability on Sequence Evolution: Applications to Phylogenetic Inference

Introduction