Abstract
For high-throughput structural genomic and evolutionary bioinformatics approaches, there is a clear need for fast methods to evaluate substitutions structurally. Coarse-grained methods are both powerful and fast, and a coarse-grained approach to position the substituted side chains is presented. Through the application of a coarse-grained method, a speed-up on the single- residue replacement, of at least sevenfold is achieved compared with modern all-atom approaches. At the same time, this approach maintains a small median RMSD from the leading all-atom approach (as measured in coarse-grained space), and predicts the conformation of point mutants with similar accuracy and generates biologically realistic side chain angles. This method is also substantially more predictable in its run time, making it useful for high-throughput studies of protein structural evolution. To demonstrate the utility of this method, it has been implemented in a forward simulation of sequences threaded through the SH2 domains, with selective pressures to fold and bind specifically. The relative substitution rates across the protein structure and at the binding interface are reflective of those observed in SH2 domain evolution. The algorithm has been implemented in C++, with the source code and binaries (currently supported for Linux systems) freely available as SARA at http://www.wyomingbioinformatics.org/LiberlesGroup/SARA.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Protein folding is an important problem, relating the sequence to the structure and ultimately to molecular function. The concerted effect of functional proteins gives rise to a phenotype on the organismal level, ultimately providing a direct link between the genome sequence and the organismal phenotype. A number of problems in evolutionary genomics increasingly rely upon an understanding of the role of structure in constraining the sequence evolution, including, the evaluation of the likelihood of observing different amino acid transitions in phylogenetics and ancestral sequence reconstruction (DePristo et al. 2005; Bridgham et al. 2009; Tokuriki and Tawfik 2009; Huzurbazar et al. 2010; Kleinman et al. 2010), as well as in simulating evolution to understand both the role of structure in determining sequence evolution (Parisi and Echave 2001; Rastogi and Liberles 2005; Rastogi et al. 2006; Massey et al. 2008) and the role of evolutionary and population genetic parameters in determining the distribution of protein structures. One particular class of experiments involves, simulating the evolution of a sequence through a structure in a population with a constant or changing selection for binding specificity to other proteins present in the simulation.
Computational analysis that links population genetics and evolution to protein structure rapidly increases in computational complexity. Numerous approaches of varying intricacy have been applied, from computationally intensive ab initio folding using molecular dynamics (MD) simulation (Voelz et al. 2010) to a highly simplified secondary structure prediction (Madera et al. 2010). The homology modeling approach for finding the correct fold, exploits the structural data already available, taking a closely related structure and modifying it to optimally fit the sequence of interest (Sali and Blundell 1993). An important problem during this process is that of finding the best configuration of the novel side chains, given an existing backbone. It has previously been shown that, single substitutions are well modeled by simply optimizing the side chain position without backbone modification (Kellogg et al. 2010). This still presents a formidable challenge because of the size of the search space of possible conformations (an exhaustive enumeration of all the possible side chain configurations in a typical protein is not feasible) (Holm and Sander 1992). Choosing between the different configurations in turn requires an evaluation of thermodynamic or informational optimality (Christ et al. 2010). A wealth of different solutions to both the issues have been implemented, focused on solving the combinatorial problem (Desmet et al. 1992; Voigt et al. 2000; Kingsford et al. 2005), and optimizing the energetics (Liang and Grishin 2002; Potapov et al. 2010). The SCWRL method (Canutescu et al. 2003) adopts sophisticated heuristics to minimize the conformational search space and utilizes a complex scoring function, making it the preferred algorithm for solving the threading problem in an all-atom representation of the protein structure (Krivov et al. 2009).
In this era of structural and interdisciplinary genomics, fast methods are needed which allows simulation of folding of very large numbers of different proteins, over evolutionary timescales, and/or in populations of proteins. Research questions have moved beyond simply asking what structure a sequence folds into on single protein to whole genome scales, but also to address questions about the evolutionary dynamics of protein function and ultimately genome functional content. Under an all-atom representation of protein structure, each side chain has many degrees of freedom, and therefore, common metrics like all versus all inter-atomic distances quickly become prohibitively time consuming to calculate, as the number of residues grows. For example, an evolutionary simulation of 103 generations in a population of 103 involves an evaluation of 106 proteins with mutations, and this is intractable with any all-atom approach, but only involves a single fold under a single evolutionary selective regime. To address questions in a large scale, coarse-grained methods are needed that summarize the properties of individual atoms. Coarse graining allows the combination of several atoms into a single “bead” and can increase the speed of computation by several orders of magnitude (Tozzini 2005). The level of coarse graining used, depends on the exact properties being investigated, from 1-bead lattice models for folding kinetics (Shakhnovich et al. 1996) to 6-bead models allowing explicit representation of backbone hydrogen bonding(Favrin et al. 2002). At a residue-specific level, Levitt proposed a 2-bead model (Levitt and Warshel 1975), where the backbone is represented as a series of Cα beads and side chains are reduced to a single Cβ bead. Variations on this theme, for use in MD force fields have been explored by other groups (Mukherjee and Bagchi 2003; Khalili et al. 2004; Hills et al. 2010). One particularly intriguing application is, the simulation of protein evolution, where, a very large number of mutated sequences must be assessed rapidly for folding or binding energy. Thus the 2-bead representation has so far provided an excellent combination of speed and detail for this task (Parisi and Echave 2001; Rastogi et al. 2006; Kleinman et al. 2010). One major limitation of the current coarse-grained approaches is the modeling of single non-synonymous mutations leading to side chain replacement very rapidly. Most available replacement methods only solve the problem in an all-atom space and are optimized to replace many residues simultaneously, rather than one at a time.
In this article, the first side chain replacement method for 2-bead coarse-grained models of protein structure, SARA (Sidechain Angular Replacement Algorithm), was implemented and tested. This method substantially reduced the computational time required for such approaches as population-level studies of protein structure, structurally constrained phylogenetic tree space search, and in particular, the important new trajectory of evolutionary simulations of protein sequence and structure. Application of SARA to the forward simulation of protein sequences is demonstrated.
Methods
A coarse-grained approach has been developed, where an all-atom representation of a protein has been reduced to a 2-bead representation. In this 2-bead model of protein structure, each residue is represented by one Cα bead (Cα i ) and one side chain bead (R i ) (Fig. 1). The Cα bead is centered on the Cα atom in an all-atom representation and has a radius of 1.8 Å. For all residue types except glycine, the Cβ bead center is determined by taking the centroid of all the side chain atoms (including the Cα atom), and the bead is assigned a residue-dependent radius (see Mukherjee and Bagchi 2003 for further details of the parameterization). In this model, glycine is simply represented with a Cα bead. For non-terminal residues, we define θ− as the angle between Cα i−1, Cα i and R i ; θ+ as the angle between R i , Cα i and Cα i+1; θa as the angle between Cα i−1, Cα i and Cα i+1; α, β and γ as the angles of rotation around the x, y and z axes, respectively. The coordinate system is reset to center on Cα i for each replacement. The residue at the N-terminus (i = 1) lacks a θ− angle, the residue at the C-terminus (i = N) lacks a θ+ angle, and both lack a θa angle.
To replace any residue not on the N- or C-terminus, the Cβ bead is inserted with its equilibrium bond length and bead radius at the same angle θ− as the previous side chain. This will place it in a starting position mostly free of steric hinderance, although replacing very small side chains with very large ones can cause collisions. Replacing any non-glycine with a glycine is accomplished by removing the Cβ bead from the structure. If the previous side chain was a glycine, or if starting from only a protein backbone, the Cβ bead is placed at an angle that bisects θa (in the plane of Cα i−1–Cα i –Cα i+1) to minimize steric clashes with nearby Cα beads. N- and C-terminal residues do not have a defined angle θa, and replacement of glycines at those positions requires a slightly different approach. To replace N-terminal glycines, the novel Cβ bead is placed such that, the Cα i –R i vector is parallel to the Cα i –Cα i+1 vector but has the opposite direction. Similarly, C-terminal glycines are replaced such that, the novel Cα i –R i vector is parallel to the Cα i−1–Cα i vector but with the opposite direction.
To optimize the geometry of a structure with one or more replaced side chains, the Cβ beads which are allowed to move are determined. The user may specify a threshold value for the radial distance around the replaced side chains, and any Cβ bead within this distance is flagged as movable. The default threshold is 0 Å, which limits the movement solely to the replaced residues. If additional flexibility in the area around the replacement is desired (repacking all local residues may be prudent if several simultaneous changes are modeled), then the user can increase this threshold. The Metropolis–Hastings Markov Chain Monte Carlo (MCMC) algorithm is used (Metropolis et al. 1953; Hastings 1970) to find a locally optimum configuration of α, β, and γ rotation angles for each movable residue. In each step, all the three rotation angles for each residue are randomly perturbed in increments of whole degrees chosen from a discretized Gaussian distribution N(μ, σ), but limited to −90° or +90° from the initial states. The value of a global energy function E is then calculated. If the novel energy E k is smaller than the previous energy E k−1, the Markov chain moves to the new position in angle space. If E k is larger than E k−1, the move is made with probability \( P = e^{{(E_{k - 1} - E_{k} )/T}} \). The algorithm terminates after n steps or when E converges to 0. E is implemented as a slightly modified version of the Lennard–Jones approximation of vdW forces used in the force field of Mukherjee and Bagchi (2003):
where r, is the distance between beads i and j, d ij is the sum of their hard-sphere radii, ε ii is the self-interaction energy of the residue type at i, and εCα is the self-interaction energy of Cα beads (Mukherjee and Bagchi 2003). The total energy is the sum of these functions calculated for all movable residues (calculated against all residues, movable or not). The user can choose parameters μ, σ, T, and n to fine-tune the behavior of the algorithm (these parameters are described above). The option of using the simple linear repulsion term used in SCWRL 3.0 is also provided (Canutescu et al. 2003). Switching to a custom energy function is a straight forward programatic change that is described on the SARA website.
To assess the accuracy and speed of the coarse-grained method, three sets of residue replacements were performed: whole-protein, single-residue, and mutant prediction. For the whole-protein replacement, the starting point was the protein backbone and the native side chain for each residue was inserted. In the single-residue experiment, every residue in a protein was replaced one at a time with a random novel side chain. The mutant prediction experiment was performed by replacing one or several residues in a wild-type structure and measuring the difference between the predicted and known mutant structures. As a basis for comparison, these experiments were also performed with SCWRL 4.0 and its all-atom solution was converted to a 2-bead representation. For all experiments, the pre- and post-replacement θ+ and θ−, RMSD from both the native structure and the SCWRL solution, differences in θ+ and θ− between this solution and the SCWRL solution were measured with respect to the 2-bead geometry. The time taken to generate the solutions was also recorded. The single-residue experiment produced a uniform distribution of residue frequencies, which is not equivalent to the distribution of residue frequencies in the background data set of amino acids in protein structures. To correct for differences in amino acid distributions, amino acids were re-sampled, with replacement from the single-residue set, such that, the sampled residue counts were equal to those in the background. The re-sampling was repeated 10 times and all measured quantities were re-calculated to prevent sampling error. Re-sampling did not affect the conclusions qualitatively. Additionally, replacements involving alanine were excluded from the analysis, as this residue had no degrees of freedom with respect to rotation in an all-atom representation.
The two first tests were performed on a subset of 100 diverse protein structures from the SCOP database (v. 1.75) (Murzin et al. 1995). Following Summa and Levitt (2007), 25 structures of a similar length (between 132 and 168 residues) from each of the four major SCOP Classes, with each structure having a SPACI score (Brenner et al. 2000) of 0.4 or higher, were selected. Mutant prediction was performed on 34 proteins from the dataset of Sinha and Nussinov (2001). The selected mutants were experimental structures that only differed from the wild-type in one or several point mutations to side chains other than Gly. Both sets cover a wide range of folds and functions to provide a rigorous test for the algorithm. All structures used are listed in the Supplementary Material. For all measurements, the parameters μ = 1, σ = 1, T = 1, and n = 100 were used. The default parameters for SCWRL 4.0 were used, and the program was also limited to 10 s of runtime, eliminating replacements that took longer.
To evaluate the suitability of the method for use in an evolutionary context, simulations in the spirit of those performed by Rastogi et al. (2006) were conducted. In each of the 10 replicate simulations, a population of 1,000 sequences was evolved up to 50% divergence from the native protein sequence, at a mutation rate of 10−5 per bp per generation. The starting point was the native DNA sequence and structure of the SAP SH2 protein (Poy et al. 1999), which functions by binding a peptide ligand. In each generation, mutations were applied at the DNA level, and the resulting protein sequences were threaded through a 2-bead representation of the protein structure using SARA. Each novel sequence–structure combination was evaluated for its thermodynamic stability and its capacity to bind the SAP ligand using the energy function of Bastolla et al. (2001). Sequences that did not preserve at least native folding and binding energy were discarded. Additionally, a selective pressure to avoid binding of a deleterious ligand was applied in the same manner as in Liberles et al. (2011). The position-specific residue frequencies in the final generation were compared with the Pfam (Finn et al. 2009) Hidden Markov Model (HMM), that describes the overall features of the SH2 family, and the 20 most conserved (as measured by Shannon information) (Crooks et al. 2004) positions were mapped onto the SAP structure for both data sets.
The side chain replacement method was implemented in the C++ programing language, encapsulated in a collection of classes that allows easy integration with existing software. A Unix binary of the SARA program, and source code, can be found at http://www.wyomingbioinformatics.org/LiberlesGroup/SARA/.
Results and Discussion
In order to facilitate high-throughput threading for studies in structural genomics or evolutionary structural biology, a coarse- grained method for side chain replacement is needed that is faster than all-atom methods. In particular, this is designed to assess the effects of single-residue replacements (analogous to single non-synonymous substitutions) during simulated evolution studies on the scale of whole populations. Given the large number of replacements being performed, it is vital to increase the efficiency of computation. In addition to using of a coarse-grained representation of proteins (reducing the number of atoms), a simplified algorithm for optimization of the novel side chain geometry is also employed. To test the performance of this method, three experiments were performed, with whole-protein replacement and single-residue replacement. The first is a commonly used test, that involves the stripping of all the side chains from a protein backbone and then re-inserting them, while the second more directly addresses the problem at hand, by replacing each side chain sequentially with a randomly selected novel residue. A third evaluation, where wild-type and mutant proteins with solved structures exist was also performed. For comparison, the solutions provided by the current best and fast all-atom method, SCWRL 4.0, and the properties of the different residue types in the native configuration were also generated.
All comparisons that were performed used equivalent measures in the 2-bead space, the space that SARA is designed for. The θ angles in the 2-bead space combine several distinct quantities in an all-atom space, the Cα-Cβ bond angle with respect to the backbone, and all possible torsion angles within the side chain. For most side chains, multiple rotamers map onto very similar Cβ bead placements and θ angles. It is not always possible to unambiguously identify a unique all-atom rotamer, which corresponds to a particular combination of θ angles. However, this lack of one-to-one reverse mapping is irrelevant to the performance of the coarse-grained side chain replacement method in coarse-grained space.
Whole-Protein Replacement
After whole-protein replacement, Fig. 2 (solid line) shows the distribution of root-mean square deviation (RMSD) between SARA and the native conformation of the side chains. The predicted positions show good agreement with the native structure, where the median deviation is only 0.11 Å and 99% of the replaced side chains are within 1.0 Å of the native state. The individual residue types, C, I, L, M, S, T, V deviate less than the overall median, while D, E, F, H, K, N, P, Q, R, W, and Y deviate more (Table 1). V is unusually well positioned (within the 1st quantile of the distribution); F, R, W, and Y are unusually poorly positioned (above the third quantile). This behavior is consistent with the nature of the method, where small and/or hydrophobic residues are well-placed as they are packed with few steric clashes, and the hydrophobicity-based energy function acts strongest on these. Large, charged, and highly polar residues are more difficult for the opposite reason, and they tend to occur on the surface, where there is little restricting geometry to “force” a native placement. The use of a alternative complex energy function may improve this. Side chains containing large planar constrained or aromatic segments (F, P, W, and Y) are particularly, poorly modeled by a sphere and consequently become problematic in a 2-bead representation.
The performance of SCWRL when reduced to a coarse-grained approximation, closely replicates native structures, and that the distribution of RMSD values between the coarse-grained solution and the native state is extremely similar to that between the coarse-grained solution and SCWRL, as suggested by previously published analysis using SCWRL (Krivov et al. 2009). This property was exploited to benchmark single-residue replacements below.
Comparing the side chain angles produced by the coarse-grained method with the native state, the distributions of the difference were fairly broad but with much of the density concentrated near 0° (Fig. 3c, d). For θ− 22% of the residues were less than 1° off (23% for θ+), and the median difference was only 13° (11° for θ+). A 20° difference was set as the threshold for correct placement. Using this definition, the coarse-grained method had an accuracy of 68% for θ− and 70% for θ+. This can be compared with SCWRL’s accuracy of 83% and 90%, respectively.
Examining the results for individual residue types as above, θ− showed C, I, M, S, T, and V as more accurately determined, while D, E, F, H, K, L, N, P, Q, R, W, and Y were worse (for θ+ C, E, I, L, M, N, Q, T, and V were more accurately determined; D, F, H, K, P, R, S, W, and Y were worse) (Table 1). Comparing this with the RMSD results, the behavior is consistent.
The coarse-grained side chain replacement method is slower than SCWRL on the problem of whole-protein replacement as SARA is not tuned to searching space efficiently for large numbers of replacements. The algorithm is designed for evolutionary and population genetic purposes, where changes are typically sampled one or a few at a time.
Single-Residue Replacement
Single-residue replacement involves the complex task of proper repacking to maintain energetically favorable contacts. With a fully native complement of residues, there is an optimal packing of the side chains given a fixed backbone. Replacing the native side chain with a randomly selected one, particularly in the core, does not necessarily result in a well-packed local environment, especially if a smaller side chain is replaced with a larger one. However, with respect to computational complexity the problem is significantly easier. Since only a single residue needs to be adjusted, the number of possible clashes and solutions is dramatically decreased. Even with a limited SCWRL runtime, the coarse-grained method is on average 21× faster (minimally 7.6× and at best 42× faster). The computational time of coarse-grained substitution is also much more predictable, having a ratio of standard deviation to mean of 0.074 compared with a ratio of 0.29 for SCWRL. This is a significant improvement on previous study (the time distributions are different on a Wilcoxon rank-sum test with P < 1.0e−10). Naturally, the smaller number of interacting bodies under the 2-bead model also reduces the computational burden.
Since there are no native structures which can be used for comparison after the single random residue replacements, and since SCWRL provides highly native-like solutions (Canutescu et al. 2003; Krivov et al. 2009), the SCWRL solution was treated as the best possible one given the parameters of this test, and therefore the RMSD and angle difference between it and the coarse-grained solution were utilized as measures of accuracy. As seen in Fig. 2 (dashed line), this test is indeed more difficult than reconstructing the native configuration. The coarse-grained method achieved a median RMSD of 0.80 Å and 92% of the replacements deviated less than 2.0 Å.
On a per-residue basis, C, D, I, L, N, S, T, and V were placed better than the overall median, while E, F, H, K, M, P, Q, R, W, and Y were placed less well (Table 1). This is consistent with the whole-protein replacement results, and the switched categorization of D and M can be explained by their medians being quite close to the overall median in both the tests.
Concerning the angles of replaced residues, the coarse-grained approach is different from most others in that it does not rely on a PDB-based rotamer library, and side chain rotations are varied continuously rather than discretely. This could theoretically lead to replacements with non-biological side chain configurations. Therefore, the distribution of the θ− and θ+ angles in the non-modified background data set was compared with those produced by the coarse-grained method and SCWRL after a single replacement (Fig. 3a, b). Overall, the agreement between the background distribution and either replacement method was fairly close with peaks and valleys similarly located although with some differences in magnitude. θ+, which had fewer and broader peaks overall, showed closer agreement than θ−. The coarse-grained method showed consistently wider peaks in the distribution, which is expected because of its continuous nature. In contrast, the SCWRL solution followed the background angles more closely, likely as a consequence of the use of a backbone-dependent rotamer library.
In addition to re-sampling of the data to correct for differences in residue frequencies, the weighting bias can be ameliorated by comparing the angle distributions of individual residue types with that of the same residue type in the background data. Using the Wilcoxon rank-sum test and a Boniferroni-corrected P-value threshold of 0.00125, the residues were divided into those that have significantly different angle distributions (presumably more poorly replaced), and those that do not (presumably accurately replaced). Although it draws solutions from a rotamer library, SCWRL produces results indistinguishable from the background angle distributions only for residue types F, H, W, and Y for θ− (D, F, H, K, M, N, P, Q, R, W, and Y for θ+). Our method gives replaced angles indistinguishable from the background for residues C, D, E, F, I, K, M, Q, R, S, T, W, and Y for θ− (C, E, F, I, R, V, and W for θ+). This somewhat unexpected result can be explained by closer examination of the shape and width of the distributions. The background data typically shows a few distinct modes for each residue, corresponding to common rotamers. Both SCWRL and SARA roughly reproduce these modes (see e.g., Fig. 3a, b), but often with peaks slightly shifted (SCWRL) or flattened (SARA). Differing peak locations, even with a qualitatively similar overall shape, biases the statistical test toward finding the significant differences. Flattened peaks and wider distributions make distinguishing between such differences more difficult, biasing the test toward reporting high similarity. The random replacement protocol sometimes places a side chain in an unusual structural context that further confounds the comparison. Taking all this into account, it is simply noted that neither method produces predicted angles that are grossly different from those found in natural protein structures, thus ensuring that the predictions are better than random.
Looking at the difference in angles between the coarse-grained and SCWRL solutions, the overall distribution (Fig. 3c, d) is quite similar to the whole-replacement case. The median difference for θ− (14°) and θ+ (15°) are both somewhat higher. The increased difficulty of the problem is also evident in that fewer replacements which are within 1° of the correct solution (12% for θ− and 9% for θ+) and that the accuracy drops to 62% for θ− and 64% for θ+.
However, when the angle differences were examined on a per-residue basis, the overall decrease in accuracy was due to very poor placement of only a few residue types, and coarse graining is actually more accurate when replacing single-residues, the intended purpose of the method. The side chains of E, I, K, L, M, N, R, S, T, and V had lower than overall median Δθ− (E, I, K, L, P, Q, S, T, and V for Δθ+); C, D, F, H, P, Q, W, and Y were further away (C, D, F, H, M, N, R, W, and Y for Δθ+). This indicates that a few more residue types are now well-placed than in the whole-replacement case. If the difference between the single- and whole-replacement cases is determined for the same measure, residues D, F, H, S, W, and Y consistently improve; only C, I, M, and T consistently become worse (Table 1). It is worth noting that although I and T lost accuracy in single-residue replacement, they were still better than average. The problematic cases were much the same as for whole-replacement, where side chains that are very large and/or mostly planar, and ones that form hydrogen bonds or salt bridges are poorly placed.
Mutant Prediction
Although the single-residue replacement experiment does provide information on the performance of the SARA method on the exact task for which it was designed, it is hampered by relying on another prediction method for benchmarking. Conversely, the whole-protein replacement does employ known protein structures for comparison but does not yield data regarding mutations. Ideally, one would compare predicted and experimentally determined mutant structures to quantify the strengths and weaknesses of any side chain replacement algorithm. Unfortunately, in contrast to the extensive databases detailing such properties as changes in the free energy of folding (Kumar et al. 2006), the number of mutants with solved structures is comparatively tiny. Nevertheless, a small set of experimentally mutated proteins was selected from the dataset of Sinha and Nussinov (2001). These mutants were all previously structurally characterized and differ from the wild-type protein only by one or several point mutations (no insertions or deletions), presenting a precise picture of the structural effects of side chain replacement alone. While modestly sized (34 structures in all), the data set covers a range of molecular functions including multiple structures of lysozyme, myoglobin, aspartic proteinase, dihydrofolate reductase, and RNAse A, and barnase. The dataset contains structures from each of the major Classes in SCOP. The full list of structures can be found in the Supplementary Material. Both SARA and SCWRL were used to predict the position of the mutated side chain(s) from the wild-type structure, and the predictions were compared with the known mutant structure. The distributions of differences between the SARA predictions and known structures in position (RSMD) and angle (Δθ− and Δθ+) are shown in Figs. 2 and 3c/d, respectively. These data confirm that the method is accurate in positioning mutated side chains: the median RMSD is only 0.56 Å, and 76% of the predicted side chains deviate less than 2.0 Å from their true coordinates. Angles of the predicted side chains are also consistent with those seen in the above experiments: the median difference is 11° for θ− (11° for θ+), yielding an overall accuracy of 76% (71% for θ+). It is interesting to note that although SCWRL is also quite accurate on this task (median RMSD 0.61 Å), confirming the use of its predictions as a benchmark in the single-residue replacement test, the SARA result is not significantly worse on this particular data set (P = 0.31, hypothesis testing as above). A closer look at the nature of the data explains why. Many of the replacements are mutations to C, S or T, all residues are handled unusually well by the SARA method.
Evolutionary Simulation
As an example, application of the method to high-throughput studies of the evolution of protein structure, the results of a largescale sequence simulation experiment are presented. The native sequence of a SH2 domain protein was evolved under the restriction of maintaining folding stability and ligand-binding function, using SARA to thread mutant sequences through a 2-bead representation of the protein structure. The effects of >1,000,000 point mutations that were evaluated in less than a week of processing would have taken over 5 months on an equivalent system using SCWRL and an all-atom model. The resulting distribution of sequences is shown as a SeqLogo (Crooks et al. 2004) in Fig. 4 (left), along with the HMM that characterizes the SH2 family of proteins in the Pfam database (Finn et al. 2009). Some important family-specific patterns are maintained: notably, small hydrophobic residues around position 6, 20, 30, 44, and 85 (1, 15, 25, 39, and 75 in Pfam). By mapping the 20 most slowly evolving positions onto the structure, one can observe that the spatial distribution of conserved positions is qualitatively the same for the simulated data and the SH2 family (Fig. 4, right). These residues correspond to the hydrophobic protein core and represent key interactions that hold the secondary structure elements together. Proper side chain packing is crucial to stability in the protein core (Dill et al. 2008), and the SARA method produces a well-packed solution even after numerous sequential single replacements. Strong conservation of residues adjoining the binding interface (around position 50) can also be observed, reflecting the selective pressure to avoid steric clashes with the ligand. Since the SH2 family of proteins bind varying ligands in vivo, this protein-specific pattern is not visible in the Pfam sequence collection.
In conclusion, the first coarse-grained side chain replacement method applicable to real protein structures has been developed and tested. The method SARA is significantly faster than the current state-of-the-art algorithm (SCWRL) when performing single-residue replacements, and more predictable in its run time, but still retains an accuracy comparable with that of the most advanced fast all-atom method, when assessed in coarse-grained space. It is anticipated that the many fold increase in the speed of computation will facilitate high-throughput studies of protein structure evolution on the level of entire populations or genomes, as well as structurally constrained phylogenetic analysis. Furthermore, this method is expected to be highly useful in the emerging field of evolutionary molecular simulations, which requires exploration of the effects of millions of single mutations, along the lines shown with the SH2 domain. The full availability of the source code will allow users to integrate the method into their own experimental systems, and it will be an important component of the molecular evolutionary simulation system currently being developed.
References
Bastolla U, Farwer J, Knapp EW, Vendruscolo M (2001) How to guarantee optimal stability for most representative structures in the protein data bank. Proteins Struct Funct Genet 44:79–96
Brenner SE, Koehl P, Levitt M (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 28:254–256
Bridgham JT, Ortlund EA, Thornton JW (2009) An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461:515–519
Canutescu AA, Shelenkov AA, Dunbrack RL (2003) A graph-theory algorithm for rapid protein sidechain prediction. Protein Sci 12:2001–2014
Christ CD, Mark AE, van Gunsteren WF (2010) Basic ingredients of free energy calculations: a review. J Comput Chem 31:1569–1582
Crooks GE, Hon G, Chandonia J-M, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Res 14:1188–1190
DePristo MA, Weinreich DM, Hartl DL (2005) Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet 6:678–687
Desmet J, Maeyer MD, Hazes B, Lasters I (1992) The dead-end elimination theorem and its use in protein side chain positioning. Nature 356:539–542
Dill KA, Ozkan SB, Shell MS, Weikl TR (2008) The protein folding problem. Annu Rev Biophys 37:289–316
Favrin G, Irbäck A, Wallin S (2002) Folding of a small helical protein using hydrogen bonds and hydrophobicity forces. Proteins 47:99–105
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K et al (2009) The Pfam protein families database. Nucleic Acids Res 38:D211–D222
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109
Hills RD, Lu L, Voth GA (2010) Multiscale coarse-graining of the protein energy landscape. PLoS Comput Biol 6:e1000827
Holm L, Sander C (1992) Fast and simple Monte Carlo algorithm for side chain optimization in proteins: application to model building by homology. Proteins 14:213–223
Huzurbazar S, Kolesov G, Massey SE, Harris KC, Churbanov A, Liberles DA (2010) Lineage-specific differences in the amino acid substitution process. J Mol Biol 396:1410–1421
Kellogg EH, Leaver-Fay A, Baker D (2010) Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21132773. Accessed December 9, 2010
Khalili M, Saunders JA, Liwo A, Ołdziej S, Scheraga HA (2004) A united residue force-field for calcium–protein interactions. Protein Sci 13:2725–2735
Kingsford CL, Chazelle B, Singh M (2005) Solving and analyzing side chain positioning problems using linear and integer programing. Bioinformatics 21:1028–1036
Kleinman CL, Rodrigue N, Lartillot N, Philippe H (2010) Statistical potentials for improved structurally constrained evolutionary models. Mol Biol Evol 27:1546–1560
Krivov GG, Shapovalov MV, Dunbrack RL (2009) Improved prediction of protein side chain conformations with SCWRL4. Proteins 77:778–795
Kumar MDS, Bava KA, Gromiha MM, Prabakaran P, Kitajima K, Uedaira H, Sarai A (2006) ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions. Nucleic Acids Res 34:D204–D206
Levitt M, Warshel A (1975) Computer simulation of protein folding. Nature 253:694–698
Liang S, Grishin NV (2002) Side chain modeling with an optimized scoring function. Protein Sci 11:322–331
Liberles DA, Tisdell MDM, Grahnen JA (2011) Binding constraints on the evolution of enzymes and signalling proteins: the important role of negative pleiotropy. Proc R Soc B: Biol Sci 278:1930–1935
Madera M, Calmus R, Thiltgen G, Karplus K, Gough J (2010) Improving protein secondary structure prediction using a simple k-mer model. Bioinformatics 26:596–602
Massey SE, Churbanov A, Rastogi S, Liberles DA (2008) Characterizing positive and negative selection and their phylogenetic effects. Gene 418:22–26
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21:1087–1092
Mukherjee A, Bagchi B (2003) Correlation between rate of folding, energy landscape, and topology in the folding of a model protein HP-36. J Chem Phys 118:4733–4747
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
Parisi G, Echave J (2001) Structural constraints and emergence of sequence patterns in protein evolution. Mol Biol Evol 18:750–756
Potapov V, Cohen M, Inbar Y, Schreiber G (2010) Protein structure modelling and evaluation based on a 4-distance description of side chain interactions. BMC Bioinform 11:374
Poy F, Yaffe MB, Sayos J, Saxena K, Morra M, Sumegi J, Cantley LC, Terhorst C, Eck MJ (1999) Crystal structures of the XLP protein SAP reveal a class of SH2 domains with extended, phosphotyrosine-independent sequence recognition. Mol Cell 4:555–561
Rastogi S, Liberles DA (2005) Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28
Rastogi S, Reuter N, Liberles DA (2006) Evaluation of models for the evolution of protein sequences and functions under structural constraint. Biophys Chem 124:134–144
Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815
Shakhnovich E, Abkevich V, Ptitsyn O (1996) Conserved residues and the mechanism of protein folding. Nature 379:96–98
Sinha N, Nussinov R (2001) Point mutations and sequence variability in proteins: redistributions of preexisting populations. Proc Natl Acad Sci USA 98:3139–3144
Summa CM, Levitt M (2007) Near-native structure refinement using in vacuo energy minimization. Proc Natl Acad Sci USA 104:3177–3182
Tokuriki N, Tawfik DS (2009) Protein dynamism and evolvability. Science 324:203–207
Tozzini V (2005) Coarse-grained models for proteins. Curr Opin Struct Biol 15:144–150
Voelz VA, Bowman GR, Beauchamp K, Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1–39). J Am Chem Soc 132:1526–1528
Voigt CA, Gordon DB, Mayo SL (2000) Trading accuracy for speed: a quantitative comparison of search algorithms in protein sequence design. J Mol Biol 299:789–803
Acknowledgments
This study was supported by an institutional NIH INBRE award to University of Wyoming (P20 RR016474). Jan Kubelka is supported by NSF CAREER award 0846140. David Liberles receives support from NSF award DBI-0743374.
Author information
Authors and Affiliations
Corresponding authors
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Grahnen, J.A., Kubelka, J. & Liberles, D.A. Fast Side Chain Replacement in Proteins Using a Coarse-Grained Approach for Evaluating the Effects of Mutation During Evolution. J Mol Evol 73, 23–33 (2011). https://doi.org/10.1007/s00239-011-9454-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-011-9454-3