Keywords

11.1 Introduction

When the second protein crystal structure was solved (haemoglobin; [16]), it was already seen to resemble the first protein crystal structure (myoglobin; [7]), and the seeds of the molecular replacement method were sown. In the subsequent half-century it has become very clear that proteins with similar amino acid sequences have similar 3D structure, and for a long time molecular replacement has been an essential tool for the macromolecular crystallographer [22].

By now about two-thirds of protein structures are solved by molecular replacement [10] and, as the Protein Data Bank continues to expand, the method can only become more prominent. The rise in molecular replacement is also fuelled, in large part, by improvements in the algorithms, from model preparation through the molecular replacement search algorithms and on to the methods used to complete structures from poor starting models.

11.2 Model Preparation

To carry out molecular replacement, it is necessary to find a template (a related structure in the PDB) and then, possibly, to modify this template to be more similar to the target structure in the unknown crystal. Until a few years ago, the application of any molecular modelling protocol that changed the coordinates of the atoms tended to make the model worse for molecular replacement than the underlying template; in essence, there are many more ways to degrade the model than to improve it.

11.2.1 Model Trimming

One simple way to improve a template is to trim off the parts that are not expected to be conserved in the target, such as a domain or a large surface loop. At times it has been popular to trim back all the side chains to give a poly-Ala model, avoiding uncertainty about side-chain conformation; we find, in general, that this is too extreme and throws away useful signal.

Schwarzenbacher et al. [24] carried out a careful study of model trimming and drew two important conclusions. First, it is generally better to leave the side chains of conserved residues in the model, because their conformation is likely to be conserved as well, but to trim back non-identical side chains (and non-conserved surface loops). Even for non-identical residues, the first torsion angle is often conserved, so it is usually a good idea to keep the gamma atom of the residue. Second, as the sequence identity drops, it becomes essential to use the best possible sequence alignment, such as one obtained by profile-profile alignment methods, so that the right side chains and surface loops are actually modified.

Another form of model preparation is carried out in MOLREP [28]. Rather than simply deleting uncertain side chains, their B-factors can be inflated to reduce their influence on the calculation in a more subtle fashion [9].

The program Sculptor [3] combines these approaches and allows a number of different model preparation protocols to be tested. Side chains and loops can be trimmed in different ways, and B-factors can be adjusted according to surface accessibility, local sequence conservation, or a combination of both. By carrying out a series of molecular replacement calculations with a number of different variations on the starting template, the overall success rate can be increased significantly.

11.2.2 Molecular Modelling

In recent years, the sophistication of molecular modelling algorithms has finally reached the point where the starting templates can be improved for molecular replacement. Impressive results have been obtained using the Rosetta modelling package to improve starting models derived from NMR experiments or from the crystal structures of homologues [17].

11.2.3 Ab Initio Modelling

In fact, even an ab initio model created by Rosetta in a blind structure prediction test was shown to be sufficiently accurate to be used successfully for molecular replacement [17]. The computational resources required to fold ab initio models of this level of accuracy are substantial, but it has subsequently been shown that, at least in favourable cases, ab initio folding methods making a more modest use of CPU time can also succeed [20].

11.2.4 Ensembles

As sequence identity drops, structures become less similar and the success rate of molecular replacement also drops. However, there is also often a greater number of choices of model at a lower sequence identity level. By collecting these into an ensemble, in which the conserved features are enhanced and the variable features are downweighted, the success rate can again be boosted. The likelihood framework, discussed below, allows a statistical weighting of the contributions of members of an ensemble, which can be helpful [18].

The success rate can also be enhanced by trimming off surface loops that are not conserved among members of the ensemble, leaving a conserved core. This was essential, for instance, in solving the structure of angiotensinogen using a collection of models with about 20 % sequence identity (Fig. 11.1; [29]). An automated trimming option has been implemented in the Ensembler program (Bunkóczi and Read unpublished), along with a robust multiple-superposition method that optimises the superposition of the conserved core.

Fig. 11.1
figure 00111

Solving the structure of angiotensinogen [29] with a trimmed ensemble. (a) Individual structures of heparin cofactor II, α1-antitrypsin and thyroxine-binding globulin. (b) Ensemble of superimposed structures. (c) Trimmed ensemble. Only the molecular replacement search with the trimmed ensemble gave a clear solution

11.3 Molecular Replacement Calculations

In principle, molecular replacement is a 6n-dimensional search to find the orientations and positions of n models, but such a large space is impractical to search exhaustively. One approach is to use stochastic methods such as genetic algorithms (EPMR; [8]) or Monte Carlo (QoS; [5]) to search in the full space. However, most molecular replacement programs, such as our program Phaser, break the problem down into a series of 3D searches with rotation functions to find the orientation of a molecule and translation functions to find its position. For problems where the model is sufficiently accurate to yield a useful map, the signal in the individual searches is usually strong enough that the correct solution at each step is found in a relatively short list of plausible partial solutions. This enables a tree-search-with-pruning strategy [14].

11.3.1 Likelihood

Traditional molecular replacement calculations were based on the properties of the Patterson map, but the use of likelihood scores has a number of advantages [18]: the influence of data at different resolutions is weighted sensibly based on the expected quality of the model, information from partial models can be taken into account, and the likelihood score can be used robustly to rank different potential solutions, which is useful for automation strategies.

The molecular replacement likelihood functions [18] are relatively expensive to compute but, fortunately, it is possible to derive good approximations that can be computed efficiently. Likelihood-based fast rotation [25] and fast translation [13] functions can be used to generate a short list of plausible solutions, which can then be ranked using the full likelihood score.

The idea of likelihood is simple: models or hypotheses can be tested by how well they agree with the measured data. Likelihood gives a probabilistic measure of agreement with the data, i.e. likelihood measures the probability that the set of data would have been measured, given the model and any associated uncertainties in the model parameters or the data. A more in-depth understanding can be obtained from the review on likelihood in crystallography by McCoy [11].

11.3.2 Automation

A molecular replacement calculation can be thought of as testing a series of hypotheses about the orientation and then the position taken by molecules in the crystal. Since likelihood is an effective measure to rank hypotheses, it lends itself to decision-making in an automated molecular replacement strategy. As noted above, Phaser uses a tree-search-with-pruning strategy. Heuristic rules (e.g. the correct solution is usually above 75 % of the distance between the mean and the top in any step of the search) are used to keep a list of plausible solutions and discard the less plausible ones. Multiple alternative models for a component can be evaluated at the same time, and the best one can be chosen by its likelihood score. Even different possible choices of space group can be evaluated. If the crystal contains a complex of different components, then the search order for the different components can be evaluated by considering how well each component would explain the data.

Increasingly, molecular replacement is being implemented as part of a pipeline, such as MrBUMP [6], BALBES [10] and AutoMR in the Phenix package [1]. Ideally, such pipelines are started by supplying only the diffraction data and the sequences of the proteins in the crystal, and then they fetch the template structures, modify them, carry out molecular replacement, and even follow that with automated building and refinement.

11.3.3 Pathologies

Experience has shown that likelihood targets are more sensitive than the traditional Patterson-based methods in finding the solution. However, this sensitivity is a double-edged sword, because likelihood is also more sensitive to errors in the assumptions used to derive the likelihood targets. One such assumption is that the crystal diffracts isotropically (i.e. equally strongly in all directions in reciprocal space). Likelihood-based molecular replacement is severely degraded by the effects of anisotropic diffraction, unless a correction is applied. Fortunately, likelihood also provides the tools to characterise the anisotropy and correct for its effects [14], and anisotropic diffraction no longer presents a problem.

Similarly, the presence of translational non-crystallographic symmetry (tNCS) also severely violates the assumptions of the original likelihood targets. In tNCS, two or more copies of the molecule are found in the same orientation in the crystal. Depending on their relative position, and how this relates to the Bragg planes for a particular reflection, they can scatter in phase (leading to exceptionally strong reflections) or out of phase (leading to exceptionally weak reflections). Until recently, the presence of tNCS was one of the leading causes for Phaser to fail in cases that would otherwise be expected to succeed. Methods to characterise tNCS and account for its statistical effects on the diffraction pattern have now been implemented in Phaser, dramatically increasing success rates in these cases (McCoy and Read unpublished).

11.4 Model Completion

When the available models are poor (typically low sequence identity) or incomplete, or the resolution of the data is limited, it has frequently been found that the molecular replacement problem can be solved but the electron density maps are too poor to see what needs to be done to complete the structure. Fortunately, a number of recent developments have markedly improved this situation.

11.4.1 Morphing and Other Smooth Deformations

Looking at distant homologues, one often sees that the basic fold is preserved, but the relative positions and orientations of structural elements have changed slightly. Even though such movements might be difficult to see in a density map at the local level, there are weak signals that can be combined over a larger region. Tom Terwilliger (personal communication) has developed a “morphing” algorithm that takes advantage of these signals. It looks for rigid-body movements that would improve the fit to density of a window of residues along the chain, and then applies that shift to the central residue in the window. By sliding the window along the chain, a smooth transformation (“morphing”) of the model is achieved. In a number of test cases, this has led to sufficient improvement in the model, and thus the phases, that further improvements to the model become clear in the density.

Refinement methods that lead to smooth deformations, such as the jelly-body method [15] or DEN refinement [23] are also very helpful in the initial stages of refinement from a poor molecular replacement model. This is illustrated clearly in a test case using DEN refinement to complete a structure that had been stuck in refinement [2].

11.4.2 Rosetta Modelling

In particularly difficult cases, the largest convergence radius in rebuilding and refining from a poor model is probably achieved by using the advanced modelling algorithms in Rosetta [4], combining the Rosetta energy functions with electron density fit scores to build into noisy density maps. The phenix.mr_rosetta pipeline [27] provides a convenient interface giving access to Rosetta modelling, molecular replacement in Phaser, and automated building and refinement in AutoBuild [26].

11.4.3 Arcimboldo

Completing the structure starting from a highly incomplete model presents similar challenges to starting from a poor but relatively complete model. The Arcimboldo procedure [21] is discussed elsewhere in greater detail by Isabel Usón. Briefly, this exploits the power of density modification and automated building algorithms to extend incomplete models comprising only a few helices, placed using Phaser.

11.5 Combined Methods

11.5.1 MR-SAD

A molecular replacement model can be used as a starting point for the computation of log-likelihood-gradient (LLG) maps to find anomalous scatterers using single-wavelength anomalous diffraction (SAD; [12, 19]). In some cases, the anomalous signal may be too weak to find the anomalous scatterers with ab initio substructure determination methods, but nonetheless significant phase information can be obtained once the sites have been found using SAD LLG maps, even if those are based on a poor molecular replacement model. In other cases, locating anomalous scatterers in a refined model can be a valuable tool for identifying unknown components, such as bound ions.

11.5.2 Using Density as a Model

Proteins frequently crystallise in multiple crystal forms and, at times, experimental phase information can only be obtained for one of these forms. In such cases, the electron density can be cut out of one map and used as a molecular replacement model to solve another crystal form.

Such a procedure was used in solving the structure of angiotensinogen [29]. A poor electron density map was available for crystals of the human form of this protein, combining information from molecular replacement with an ensemble of distant models at 3.3 Å resolution with SAD phases from a GdCl3 derivative at 4 Å resolution. Molecular replacement with the same ensemble model did not succeed in solving the structures of crystals from rat or mouse angiotensinogen, but electron density extracted from the map of the human form did give a clear solution for two copies of angiotensinogen in one of the rat crystal forms. In turn, averaged density from this rat crystal form could be used to find two copies in the second rat crystal form, allowing 4-fold multi-crystal averaging to be initiated between the two rat crystal forms.

Molecular replacement serves two purposes for multi-crystal averaging, in such cases: it defines the rotation and translation operators that superimpose the density in one crystal on the density in the other crystal, and it provides initial phases for the second crystal form.

11.6 Future Developments

There has been rapid progress in recent years in the power and reach of molecular replacement, and there are good reasons to believe that this will continue. As density modification and model-building algorithms improve, it will become possible to solve structures from even less complete and less accurate starting points. Improvements in our understanding of the likelihood targets will feed into better automation strategies, both by allowing us to predict how good the model must be to have a chance of success, and by providing measures of confidence in partial solutions obtained along the solution path. Even if there were no improvements in the algorithms, the continued rapid growth of the PDB would ensure that there are good models for an ever-expanding set of targets.