Introduction

Protein folding is an important problem, relating the sequence to the structure and ultimately to molecular function. The concerted effect of functional proteins gives rise to a phenotype on the organismal level, ultimately providing a direct link between the genome sequence and the organismal phenotype. A number of problems in evolutionary genomics increasingly rely upon an understanding of the role of structure in constraining the sequence evolution, including, the evaluation of the likelihood of observing different amino acid transitions in phylogenetics and ancestral sequence reconstruction (DePristo et al. 2005; Bridgham et al. 2009; Tokuriki and Tawfik 2009; Huzurbazar et al. 2010; Kleinman et al. 2010), as well as in simulating evolution to understand both the role of structure in determining sequence evolution (Parisi and Echave 2001; Rastogi and Liberles 2005; Rastogi et al. 2006; Massey et al. 2008) and the role of evolutionary and population genetic parameters in determining the distribution of protein structures. One particular class of experiments involves, simulating the evolution of a sequence through a structure in a population with a constant or changing selection for binding specificity to other proteins present in the simulation.

Computational analysis that links population genetics and evolution to protein structure rapidly increases in computational complexity. Numerous approaches of varying intricacy have been applied, from computationally intensive ab initio folding using molecular dynamics (MD) simulation (Voelz et al. 2010) to a highly simplified secondary structure prediction (Madera et al. 2010). The homology modeling approach for finding the correct fold, exploits the structural data already available, taking a closely related structure and modifying it to optimally fit the sequence of interest (Sali and Blundell 1993). An important problem during this process is that of finding the best configuration of the novel side chains, given an existing backbone. It has previously been shown that, single substitutions are well modeled by simply optimizing the side chain position without backbone modification (Kellogg et al. 2010). This still presents a formidable challenge because of the size of the search space of possible conformations (an exhaustive enumeration of all the possible side chain configurations in a typical protein is not feasible) (Holm and Sander 1992). Choosing between the different configurations in turn requires an evaluation of thermodynamic or informational optimality (Christ et al. 2010). A wealth of different solutions to both the issues have been implemented, focused on solving the combinatorial problem (Desmet et al. 1992; Voigt et al. 2000; Kingsford et al. 2005), and optimizing the energetics (Liang and Grishin 2002; Potapov et al. 2010). The SCWRL method (Canutescu et al. 2003) adopts sophisticated heuristics to minimize the conformational search space and utilizes a complex scoring function, making it the preferred algorithm for solving the threading problem in an all-atom representation of the protein structure (Krivov et al. 2009).

In this era of structural and interdisciplinary genomics, fast methods are needed which allows simulation of folding of very large numbers of different proteins, over evolutionary timescales, and/or in populations of proteins. Research questions have moved beyond simply asking what structure a sequence folds into on single protein to whole genome scales, but also to address questions about the evolutionary dynamics of protein function and ultimately genome functional content. Under an all-atom representation of protein structure, each side chain has many degrees of freedom, and therefore, common metrics like all versus all inter-atomic distances quickly become prohibitively time consuming to calculate, as the number of residues grows. For example, an evolutionary simulation of 103 generations in a population of 103 involves an evaluation of 106 proteins with mutations, and this is intractable with any all-atom approach, but only involves a single fold under a single evolutionary selective regime. To address questions in a large scale, coarse-grained methods are needed that summarize the properties of individual atoms. Coarse graining allows the combination of several atoms into a single “bead” and can increase the speed of computation by several orders of magnitude (Tozzini 2005). The level of coarse graining used, depends on the exact properties being investigated, from 1-bead lattice models for folding kinetics (Shakhnovich et al. 1996) to 6-bead models allowing explicit representation of backbone hydrogen bonding(Favrin et al. 2002). At a residue-specific level, Levitt proposed a 2-bead model (Levitt and Warshel 1975), where the backbone is represented as a series of Cα beads and side chains are reduced to a single Cβ bead. Variations on this theme, for use in MD force fields have been explored by other groups (Mukherjee and Bagchi 2003; Khalili et al. 2004; Hills et al. 2010). One particularly intriguing application is, the simulation of protein evolution, where, a very large number of mutated sequences must be assessed rapidly for folding or binding energy. Thus the 2-bead representation has so far provided an excellent combination of speed and detail for this task (Parisi and Echave 2001; Rastogi et al. 2006; Kleinman et al. 2010). One major limitation of the current coarse-grained approaches is the modeling of single non-synonymous mutations leading to side chain replacement very rapidly. Most available replacement methods only solve the problem in an all-atom space and are optimized to replace many residues simultaneously, rather than one at a time.

In this article, the first side chain replacement method for 2-bead coarse-grained models of protein structure, SARA (Sidechain Angular Replacement Algorithm), was implemented and tested. This method substantially reduced the computational time required for such approaches as population-level studies of protein structure, structurally constrained phylogenetic tree space search, and in particular, the important new trajectory of evolutionary simulations of protein sequence and structure. Application of SARA to the forward simulation of protein sequences is demonstrated.

Methods

A coarse-grained approach has been developed, where an all-atom representation of a protein has been reduced to a 2-bead representation. In this 2-bead model of protein structure, each residue is represented by one Cα bead (Cα i ) and one side chain bead (R i ) (Fig. 1). The Cα bead is centered on the Cα atom in an all-atom representation and has a radius of 1.8 Å. For all residue types except glycine, the Cβ bead center is determined by taking the centroid of all the side chain atoms (including the Cα atom), and the bead is assigned a residue-dependent radius (see Mukherjee and Bagchi 2003 for further details of the parameterization). In this model, glycine is simply represented with a Cα bead. For non-terminal residues, we define θ− as the angle between Cα i−1, Cα i and R i ; θ+ as the angle between R i , Cα i and Cα i+1; θa as the angle between Cα i−1, Cα i and Cα i+1; α, β and γ as the angles of rotation around the x, y and z axes, respectively. The coordinate system is reset to center on Cα i for each replacement. The residue at the N-terminus (i = 1) lacks a θ− angle, the residue at the C-terminus (i = N) lacks a θ+ angle, and both lack a θa angle.

Fig. 1
figure 1

The 2-bead model of protein structure is presented. Cα beads (Cα i ) are gray, Cβ beads (R i ) are black, and virtual bonds are gray lines; planar angles θa, θ−, and θ+ are shown with arcs; rotation angles α, β, and γ are shown with arced arrows. The coordinate axes x, y, and z are shown with straight arrows

To replace any residue not on the N- or C-terminus, the Cβ bead is inserted with its equilibrium bond length and bead radius at the same angle θ− as the previous side chain. This will place it in a starting position mostly free of steric hinderance, although replacing very small side chains with very large ones can cause collisions. Replacing any non-glycine with a glycine is accomplished by removing the Cβ bead from the structure. If the previous side chain was a glycine, or if starting from only a protein backbone, the Cβ bead is placed at an angle that bisects θa (in the plane of Cα i−1–Cα i –Cα i+1) to minimize steric clashes with nearby Cα beads. N- and C-terminal residues do not have a defined angle θa, and replacement of glycines at those positions requires a slightly different approach. To replace N-terminal glycines, the novel Cβ bead is placed such that, the Cα i R i vector is parallel to the Cα i –Cα i+1 vector but has the opposite direction. Similarly, C-terminal glycines are replaced such that, the novel Cα i R i vector is parallel to the Cα i−1–Cα i vector but with the opposite direction.

To optimize the geometry of a structure with one or more replaced side chains, the Cβ beads which are allowed to move are determined. The user may specify a threshold value for the radial distance around the replaced side chains, and any Cβ bead within this distance is flagged as movable. The default threshold is 0 Å, which limits the movement solely to the replaced residues. If additional flexibility in the area around the replacement is desired (repacking all local residues may be prudent if several simultaneous changes are modeled), then the user can increase this threshold. The Metropolis–Hastings Markov Chain Monte Carlo (MCMC) algorithm is used (Metropolis et al. 1953; Hastings 1970) to find a locally optimum configuration of α, β, and γ rotation angles for each movable residue. In each step, all the three rotation angles for each residue are randomly perturbed in increments of whole degrees chosen from a discretized Gaussian distribution N(μ, σ), but limited to −90° or +90° from the initial states. The value of a global energy function E is then calculated. If the novel energy E k is smaller than the previous energy E k−1, the Markov chain moves to the new position in angle space. If E k is larger than E k−1, the move is made with probability \( P = e^{{(E_{k - 1} - E_{k} )/T}} \). The algorithm terminates after n steps or when E converges to 0. E is implemented as a slightly modified version of the Lennard–Jones approximation of vdW forces used in the force field of Mukherjee and Bagchi (2003):

$$ E = \sum\limits_{{R_{i} ,R_{j} }} {4\sqrt {\varepsilon_{ii} \varepsilon_{jj} } \left( {\left( {\frac{{d_{ij} }}{r}} \right)^{12} - \left( {\frac{{d_{ij} }}{r}} \right)^{6} + \frac{1}{4}} \right)} + \sum\limits_{{R_{i} ,{\text{C}}\alpha_{j} }} {4\sqrt {\varepsilon_{ii} \varepsilon_{{{\text{C}}\alpha }} } \left( {\left( {\frac{{d_{ij} }}{r}} \right)^{12} - \left( {\frac{{d_{ij} }}{r}} \right)^{6} + \frac{1}{4}} \right)} $$

where r, is the distance between beads i and j, d ij is the sum of their hard-sphere radii, ε ii is the self-interaction energy of the residue type at i, and ε is the self-interaction energy of Cα beads (Mukherjee and Bagchi 2003). The total energy is the sum of these functions calculated for all movable residues (calculated against all residues, movable or not). The user can choose parameters μ, σ, T, and n to fine-tune the behavior of the algorithm (these parameters are described above). The option of using the simple linear repulsion term used in SCWRL 3.0 is also provided (Canutescu et al. 2003). Switching to a custom energy function is a straight forward programatic change that is described on the SARA website.

To assess the accuracy and speed of the coarse-grained method, three sets of residue replacements were performed: whole-protein, single-residue, and mutant prediction. For the whole-protein replacement, the starting point was the protein backbone and the native side chain for each residue was inserted. In the single-residue experiment, every residue in a protein was replaced one at a time with a random novel side chain. The mutant prediction experiment was performed by replacing one or several residues in a wild-type structure and measuring the difference between the predicted and known mutant structures. As a basis for comparison, these experiments were also performed with SCWRL 4.0 and its all-atom solution was converted to a 2-bead representation. For all experiments, the pre- and post-replacement θ+ and θ−, RMSD from both the native structure and the SCWRL solution, differences in θ+ and θ− between this solution and the SCWRL solution were measured with respect to the 2-bead geometry. The time taken to generate the solutions was also recorded. The single-residue experiment produced a uniform distribution of residue frequencies, which is not equivalent to the distribution of residue frequencies in the background data set of amino acids in protein structures. To correct for differences in amino acid distributions, amino acids were re-sampled, with replacement from the single-residue set, such that, the sampled residue counts were equal to those in the background. The re-sampling was repeated 10 times and all measured quantities were re-calculated to prevent sampling error. Re-sampling did not affect the conclusions qualitatively. Additionally, replacements involving alanine were excluded from the analysis, as this residue had no degrees of freedom with respect to rotation in an all-atom representation.

The two first tests were performed on a subset of 100 diverse protein structures from the SCOP database (v. 1.75) (Murzin et al. 1995). Following Summa and Levitt (2007), 25 structures of a similar length (between 132 and 168 residues) from each of the four major SCOP Classes, with each structure having a SPACI score (Brenner et al. 2000) of 0.4 or higher, were selected. Mutant prediction was performed on 34 proteins from the dataset of Sinha and Nussinov (2001). The selected mutants were experimental structures that only differed from the wild-type in one or several point mutations to side chains other than Gly. Both sets cover a wide range of folds and functions to provide a rigorous test for the algorithm. All structures used are listed in the Supplementary Material. For all measurements, the parameters μ = 1, σ = 1, T = 1, and n = 100 were used. The default parameters for SCWRL 4.0 were used, and the program was also limited to 10 s of runtime, eliminating replacements that took longer.

To evaluate the suitability of the method for use in an evolutionary context, simulations in the spirit of those performed by Rastogi et al. (2006) were conducted. In each of the 10 replicate simulations, a population of 1,000 sequences was evolved up to 50% divergence from the native protein sequence, at a mutation rate of 10−5 per bp per generation. The starting point was the native DNA sequence and structure of the SAP SH2 protein (Poy et al. 1999), which functions by binding a peptide ligand. In each generation, mutations were applied at the DNA level, and the resulting protein sequences were threaded through a 2-bead representation of the protein structure using SARA. Each novel sequence–structure combination was evaluated for its thermodynamic stability and its capacity to bind the SAP ligand using the energy function of Bastolla et al. (2001). Sequences that did not preserve at least native folding and binding energy were discarded. Additionally, a selective pressure to avoid binding of a deleterious ligand was applied in the same manner as in Liberles et al. (2011). The position-specific residue frequencies in the final generation were compared with the Pfam (Finn et al. 2009) Hidden Markov Model (HMM), that describes the overall features of the SH2 family, and the 20 most conserved (as measured by Shannon information) (Crooks et al. 2004) positions were mapped onto the SAP structure for both data sets.

The side chain replacement method was implemented in the C++ programing language, encapsulated in a collection of classes that allows easy integration with existing software. A Unix binary of the SARA program, and source code, can be found at http://www.wyomingbioinformatics.org/LiberlesGroup/SARA/.

Results and Discussion

In order to facilitate high-throughput threading for studies in structural genomics or evolutionary structural biology, a coarse- grained method for side chain replacement is needed that is faster than all-atom methods. In particular, this is designed to assess the effects of single-residue replacements (analogous to single non-synonymous substitutions) during simulated evolution studies on the scale of whole populations. Given the large number of replacements being performed, it is vital to increase the efficiency of computation. In addition to using of a coarse-grained representation of proteins (reducing the number of atoms), a simplified algorithm for optimization of the novel side chain geometry is also employed. To test the performance of this method, three experiments were performed, with whole-protein replacement and single-residue replacement. The first is a commonly used test, that involves the stripping of all the side chains from a protein backbone and then re-inserting them, while the second more directly addresses the problem at hand, by replacing each side chain sequentially with a randomly selected novel residue. A third evaluation, where wild-type and mutant proteins with solved structures exist was also performed. For comparison, the solutions provided by the current best and fast all-atom method, SCWRL 4.0, and the properties of the different residue types in the native configuration were also generated.

All comparisons that were performed used equivalent measures in the 2-bead space, the space that SARA is designed for. The θ angles in the 2-bead space combine several distinct quantities in an all-atom space, the Cα-Cβ bond angle with respect to the backbone, and all possible torsion angles within the side chain. For most side chains, multiple rotamers map onto very similar Cβ bead placements and θ angles. It is not always possible to unambiguously identify a unique all-atom rotamer, which corresponds to a particular combination of θ angles. However, this lack of one-to-one reverse mapping is irrelevant to the performance of the coarse-grained side chain replacement method in coarse-grained space.

Whole-Protein Replacement

After whole-protein replacement, Fig. 2 (solid line) shows the distribution of root-mean square deviation (RMSD) between SARA and the native conformation of the side chains. The predicted positions show good agreement with the native structure, where the median deviation is only 0.11 Å and 99% of the replaced side chains are within 1.0 Å of the native state. The individual residue types, C, I, L, M, S, T, V deviate less than the overall median, while D, E, F, H, K, N, P, Q, R, W, and Y deviate more (Table 1). V is unusually well positioned (within the 1st quantile of the distribution); F, R, W, and Y are unusually poorly positioned (above the third quantile). This behavior is consistent with the nature of the method, where small and/or hydrophobic residues are well-placed as they are packed with few steric clashes, and the hydrophobicity-based energy function acts strongest on these. Large, charged, and highly polar residues are more difficult for the opposite reason, and they tend to occur on the surface, where there is little restricting geometry to “force” a native placement. The use of a alternative complex energy function may improve this. Side chains containing large planar constrained or aromatic segments (F, P, W, and Y) are particularly, poorly modeled by a sphere and consequently become problematic in a 2-bead representation.

Fig. 2
figure 2

Distributions of root-mean square deviation (RMSD) between the SARA solution and the native state after whole-protein replacement (solid line), the SARA solution, and the SCWRL solution after single-residue replacement (dashed line), and the SARA solution and the known mutant crystal structure (dotted line), are shown

Table 1 Characteristics of individual residue types after whole-protein and single-residue replacement are shown

The performance of SCWRL when reduced to a coarse-grained approximation, closely replicates native structures, and that the distribution of RMSD values between the coarse-grained solution and the native state is extremely similar to that between the coarse-grained solution and SCWRL, as suggested by previously published analysis using SCWRL (Krivov et al. 2009). This property was exploited to benchmark single-residue replacements below.

Comparing the side chain angles produced by the coarse-grained method with the native state, the distributions of the difference were fairly broad but with much of the density concentrated near 0° (Fig. 3c, d). For θ− 22% of the residues were less than 1° off (23% for θ+), and the median difference was only 13° (11° for θ+). A 20° difference was set as the threshold for correct placement. Using this definition, the coarse-grained method had an accuracy of 68% for θ− and 70% for θ+. This can be compared with SCWRL’s accuracy of 83% and 90%, respectively.

Fig. 3
figure 3

Distributions of side chain angles after replacement are shown. a θ− side chain angles in the background set (solid line), the SCWRL solution after single-residue replacement (dashed line), and the SARA solution after single-residue replacement (dotted line) are shown. b θ+ side chain angles in the background set (solid line), the SCWRL solution after single-residue replacement (dashed line), and the SARA solution after single-residue replacement (dotted line) are shown. c The differences in θ− between the native state and the SARA solution after whole-protein replacement (solid line), between the SARA solution and the SCWRL solution after single-residue replacement (dashed line), and between the SARA solution and the known mutant crystal structure (dotted line), are shown. d The differences in θ+ between the native state and the SARA solution after whole-protein replacement (solid line), between the SARA solution and the SCWRL solution after single-residue replacement (dashed line), and between the SARA solution and the known mutant crystal structure (dotted line), are shown

Examining the results for individual residue types as above, θ− showed C, I, M, S, T, and V as more accurately determined, while D, E, F, H, K, L, N, P, Q, R, W, and Y were worse (for θ+ C, E, I, L, M, N, Q, T, and V were more accurately determined; D, F, H, K, P, R, S, W, and Y were worse) (Table 1). Comparing this with the RMSD results, the behavior is consistent.

The coarse-grained side chain replacement method is slower than SCWRL on the problem of whole-protein replacement as SARA is not tuned to searching space efficiently for large numbers of replacements. The algorithm is designed for evolutionary and population genetic purposes, where changes are typically sampled one or a few at a time.

Single-Residue Replacement

Single-residue replacement involves the complex task of proper repacking to maintain energetically favorable contacts. With a fully native complement of residues, there is an optimal packing of the side chains given a fixed backbone. Replacing the native side chain with a randomly selected one, particularly in the core, does not necessarily result in a well-packed local environment, especially if a smaller side chain is replaced with a larger one. However, with respect to computational complexity the problem is significantly easier. Since only a single residue needs to be adjusted, the number of possible clashes and solutions is dramatically decreased. Even with a limited SCWRL runtime, the coarse-grained method is on average 21× faster (minimally 7.6× and at best 42× faster). The computational time of coarse-grained substitution is also much more predictable, having a ratio of standard deviation to mean of 0.074 compared with a ratio of 0.29 for SCWRL. This is a significant improvement on previous study (the time distributions are different on a Wilcoxon rank-sum test with P < 1.0e−10). Naturally, the smaller number of interacting bodies under the 2-bead model also reduces the computational burden.

Since there are no native structures which can be used for comparison after the single random residue replacements, and since SCWRL provides highly native-like solutions (Canutescu et al. 2003; Krivov et al. 2009), the SCWRL solution was treated as the best possible one given the parameters of this test, and therefore the RMSD and angle difference between it and the coarse-grained solution were utilized as measures of accuracy. As seen in Fig. 2 (dashed line), this test is indeed more difficult than reconstructing the native configuration. The coarse-grained method achieved a median RMSD of 0.80 Å and 92% of the replacements deviated less than 2.0 Å.

On a per-residue basis, C, D, I, L, N, S, T, and V were placed better than the overall median, while E, F, H, K, M, P, Q, R, W, and Y were placed less well (Table 1). This is consistent with the whole-protein replacement results, and the switched categorization of D and M can be explained by their medians being quite close to the overall median in both the tests.

Concerning the angles of replaced residues, the coarse-grained approach is different from most others in that it does not rely on a PDB-based rotamer library, and side chain rotations are varied continuously rather than discretely. This could theoretically lead to replacements with non-biological side chain configurations. Therefore, the distribution of the θ− and θ+ angles in the non-modified background data set was compared with those produced by the coarse-grained method and SCWRL after a single replacement (Fig. 3a, b). Overall, the agreement between the background distribution and either replacement method was fairly close with peaks and valleys similarly located although with some differences in magnitude. θ+, which had fewer and broader peaks overall, showed closer agreement than θ−. The coarse-grained method showed consistently wider peaks in the distribution, which is expected because of its continuous nature. In contrast, the SCWRL solution followed the background angles more closely, likely as a consequence of the use of a backbone-dependent rotamer library.

In addition to re-sampling of the data to correct for differences in residue frequencies, the weighting bias can be ameliorated by comparing the angle distributions of individual residue types with that of the same residue type in the background data. Using the Wilcoxon rank-sum test and a Boniferroni-corrected P-value threshold of 0.00125, the residues were divided into those that have significantly different angle distributions (presumably more poorly replaced), and those that do not (presumably accurately replaced). Although it draws solutions from a rotamer library, SCWRL produces results indistinguishable from the background angle distributions only for residue types F, H, W, and Y for θ− (D, F, H, K, M, N, P, Q, R, W, and Y for θ+). Our method gives replaced angles indistinguishable from the background for residues C, D, E, F, I, K, M, Q, R, S, T, W, and Y for θ− (C, E, F, I, R, V, and W for θ+). This somewhat unexpected result can be explained by closer examination of the shape and width of the distributions. The background data typically shows a few distinct modes for each residue, corresponding to common rotamers. Both SCWRL and SARA roughly reproduce these modes (see e.g., Fig. 3a, b), but often with peaks slightly shifted (SCWRL) or flattened (SARA). Differing peak locations, even with a qualitatively similar overall shape, biases the statistical test toward finding the significant differences. Flattened peaks and wider distributions make distinguishing between such differences more difficult, biasing the test toward reporting high similarity. The random replacement protocol sometimes places a side chain in an unusual structural context that further confounds the comparison. Taking all this into account, it is simply noted that neither method produces predicted angles that are grossly different from those found in natural protein structures, thus ensuring that the predictions are better than random.

Looking at the difference in angles between the coarse-grained and SCWRL solutions, the overall distribution (Fig. 3c, d) is quite similar to the whole-replacement case. The median difference for θ− (14°) and θ+ (15°) are both somewhat higher. The increased difficulty of the problem is also evident in that fewer replacements which are within 1° of the correct solution (12% for θ− and 9% for θ+) and that the accuracy drops to 62% for θ− and 64% for θ+.

However, when the angle differences were examined on a per-residue basis, the overall decrease in accuracy was due to very poor placement of only a few residue types, and coarse graining is actually more accurate when replacing single-residues, the intended purpose of the method. The side chains of E, I, K, L, M, N, R, S, T, and V had lower than overall median Δθ− (E, I, K, L, P, Q, S, T, and V for Δθ+); C, D, F, H, P, Q, W, and Y were further away (C, D, F, H, M, N, R, W, and Y for Δθ+). This indicates that a few more residue types are now well-placed than in the whole-replacement case. If the difference between the single- and whole-replacement cases is determined for the same measure, residues D, F, H, S, W, and Y consistently improve; only C, I, M, and T consistently become worse (Table 1). It is worth noting that although I and T lost accuracy in single-residue replacement, they were still better than average. The problematic cases were much the same as for whole-replacement, where side chains that are very large and/or mostly planar, and ones that form hydrogen bonds or salt bridges are poorly placed.

Mutant Prediction

Although the single-residue replacement experiment does provide information on the performance of the SARA method on the exact task for which it was designed, it is hampered by relying on another prediction method for benchmarking. Conversely, the whole-protein replacement does employ known protein structures for comparison but does not yield data regarding mutations. Ideally, one would compare predicted and experimentally determined mutant structures to quantify the strengths and weaknesses of any side chain replacement algorithm. Unfortunately, in contrast to the extensive databases detailing such properties as changes in the free energy of folding (Kumar et al. 2006), the number of mutants with solved structures is comparatively tiny. Nevertheless, a small set of experimentally mutated proteins was selected from the dataset of Sinha and Nussinov (2001). These mutants were all previously structurally characterized and differ from the wild-type protein only by one or several point mutations (no insertions or deletions), presenting a precise picture of the structural effects of side chain replacement alone. While modestly sized (34 structures in all), the data set covers a range of molecular functions including multiple structures of lysozyme, myoglobin, aspartic proteinase, dihydrofolate reductase, and RNAse A, and barnase. The dataset contains structures from each of the major Classes in SCOP. The full list of structures can be found in the Supplementary Material. Both SARA and SCWRL were used to predict the position of the mutated side chain(s) from the wild-type structure, and the predictions were compared with the known mutant structure. The distributions of differences between the SARA predictions and known structures in position (RSMD) and angle (Δθ− and Δθ+) are shown in Figs. 2 and 3c/d, respectively. These data confirm that the method is accurate in positioning mutated side chains: the median RMSD is only 0.56 Å, and 76% of the predicted side chains deviate less than 2.0 Å from their true coordinates. Angles of the predicted side chains are also consistent with those seen in the above experiments: the median difference is 11° for θ− (11° for θ+), yielding an overall accuracy of 76% (71% for θ+). It is interesting to note that although SCWRL is also quite accurate on this task (median RMSD 0.61 Å), confirming the use of its predictions as a benchmark in the single-residue replacement test, the SARA result is not significantly worse on this particular data set (P = 0.31, hypothesis testing as above). A closer look at the nature of the data explains why. Many of the replacements are mutations to C, S or T, all residues are handled unusually well by the SARA method.

Evolutionary Simulation

As an example, application of the method to high-throughput studies of the evolution of protein structure, the results of a largescale sequence simulation experiment are presented. The native sequence of a SH2 domain protein was evolved under the restriction of maintaining folding stability and ligand-binding function, using SARA to thread mutant sequences through a 2-bead representation of the protein structure. The effects of >1,000,000 point mutations that were evaluated in less than a week of processing would have taken over 5 months on an equivalent system using SCWRL and an all-atom model. The resulting distribution of sequences is shown as a SeqLogo (Crooks et al. 2004) in Fig. 4 (left), along with the HMM that characterizes the SH2 family of proteins in the Pfam database (Finn et al. 2009). Some important family-specific patterns are maintained: notably, small hydrophobic residues around position 6, 20, 30, 44, and 85 (1, 15, 25, 39, and 75 in Pfam). By mapping the 20 most slowly evolving positions onto the structure, one can observe that the spatial distribution of conserved positions is qualitatively the same for the simulated data and the SH2 family (Fig. 4, right). These residues correspond to the hydrophobic protein core and represent key interactions that hold the secondary structure elements together. Proper side chain packing is crucial to stability in the protein core (Dill et al. 2008), and the SARA method produces a well-packed solution even after numerous sequential single replacements. Strong conservation of residues adjoining the binding interface (around position 50) can also be observed, reflecting the selective pressure to avoid steric clashes with the ligand. Since the SH2 family of proteins bind varying ligands in vivo, this protein-specific pattern is not visible in the Pfam sequence collection.

Fig. 4
figure 4

Distribution of residue frequencies and information content for simulated and real sequences from the SH2 family. Matching parts of sequences are indicated by boxes. a Sequence logo for the final generation of 10 replicate populations of simulated sequences, and mapping of the 20 most conserved positions onto the structure of SH2 family member protein SAP (binding interface facing the viewer). The sequence logo shows the amount of Shannon information (Y axis) per position, with residue frequencies indicated by the size of each letter, as sampled across all individuals from the 10 replicate populations. The structure shows a white cartoon representation of the backbone, with conserved positions in black. b HMM emission probabilities for the sequence collection representing the SH2 family in the Pfam database, and mapping of the 20 positions that contribute most to the overall predictive probability of the domain onto the structure of SAP. Interpretation with respect to residue frequency and conservation is analogous to that of the sequence logo in “a”. Note that divergence within this family is substantially larger than that of the simulated sequences causing decreased information content per position (due to increased randomization over time). The seed tree from for SH2 domains in Pfam (Finn et al. 2009) is shown to depict how the sequence logo is generated

In conclusion, the first coarse-grained side chain replacement method applicable to real protein structures has been developed and tested. The method SARA is significantly faster than the current state-of-the-art algorithm (SCWRL) when performing single-residue replacements, and more predictable in its run time, but still retains an accuracy comparable with that of the most advanced fast all-atom method, when assessed in coarse-grained space. It is anticipated that the many fold increase in the speed of computation will facilitate high-throughput studies of protein structure evolution on the level of entire populations or genomes, as well as structurally constrained phylogenetic analysis. Furthermore, this method is expected to be highly useful in the emerging field of evolutionary molecular simulations, which requires exploration of the effects of millions of single mutations, along the lines shown with the SH2 domain. The full availability of the source code will allow users to integrate the method into their own experimental systems, and it will be an important component of the molecular evolutionary simulation system currently being developed.