Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

We are like dwarfs sitting on the shoulders of giants. We see more, and things that are more distant, than they did, not because our sight is superior or because we are taller than they, but because they raise us up, and by their great stature add to ours … This sentence was written (in Latin) in the logic treatise Metalogicon by John of Salisbury in 1159. Salisbury attributed this sentence to Bernard of Chartres. It was reused later by Isaac Newton to explain the development of western science.

1 Introduction

The goal of this chapter is to give an overview of protein structure determination by nuclear magnetic resonance(NMR). We will give a historical perspective that illustrates the necessity of solving distance geometry problems in order to determine protein structure. Briefly, the problem consists of exploiting experimental information that is obtained from NMR experiments and that mainly concerns distances between hydrogen atoms, in order to find the three-dimensional structure of a protein. Together with the NMR information, we can also use additional information deduced from the knowledge accumulated during the twentieth century on molecular structures. For this reason, the sentence by John of Salisbury that we quoted perfectly applies to protein structure determination.

This chapter is organized as follows. In Sect. 18.2, we briefly introduce protein structures. In Sect. 18.3, we give a description of the conformational space of protein structures, while we discuss about molecular dynamics in Sect. 18.4. In Sect. 18.5 we briefly describe NMR experiments, and Sect. 18.6 is devoted to the problem of deriving some atomic distance restraints from NMR data. Section 18.7 is devoted to some pseudo-potentials that can be used for modeling the distance restraints, while the distance geometry problem with NMR data is discussed in Sect. 18.8, where the first implemented computational method for protein structural calculation from NMR data is presented. Nowadays, the most used method for solving distance geometry problems with NMR data is the meta-heuristic simulated annealing (SA): we present two variants of this algorithm in Sect. 18.9, one that is based on the Cartesian representation of the protein structures and the other one that is based on the torsion angle representation. We conclude our chapter in Sect. 18.10, where we discuss some future demands for protein structure determination.

2 Introduction to Protein Structure

The determination of protein high resolution structures is essential for the understanding of complex biological mechanisms, for the development of biotechnological methods, drug design, and many other applications. Requesting the protein structure to have a high resolution implies that the position of each of its atom is identified precisely (uncertainty smaller than 1 Å).

Proteins are polymeric chains in which the units are the 20 natural L-α-aminoacids that are connected by peptide bonds. Several structures of dipeptides, which have been solved by X-ray crystallography in the early 1930s by the group led by Linus Pauling, demonstrated that the peptide bond can have two configurations: cis and trans [34, 57]. The trans configuration has lower energy and it represents the most abundant configuration in proteins. There are however exceptions, such as cis-prolines which are important for thioredoxin activity [13, 29, 30]. The peptide bond is planar because of the resonance effect that gives to it a double-bond character. Figure 18.1 illustrates a polypeptide chain and the planar character of the peptide bond.

Fig. 18.1
figure 1

Illustration of a peptide bond planar structure: (a) the planar peptide bond (dotted rectangle); the dihedral angles that give torsion freedom for the peptide backbone (Φ and Ψ angles) are indicated by arrows; the side chains are represented by R 1 and R 2; (b) resonance forms of the peptide bond and its double-bond character

Several other geometrical properties of proteins were defined before the first protein structure was solved by Kauzmann in 1964 [34]. Maybe the most important property is given by the presence of secondary structure elements, such as α-helices, that was firstly proposed by Linus Pauling [5659].

The amino acid sequence is also called primary structure. An amino acid included in a polypeptide chain is called amino acid residue. Secondary structures represent local structural organizations that are stabilized by hydrogen bonds in the main chain. They can be observed in several proteins. The main polypeptide chain, also called protein backbone, is the protein sequence without the radicals of each L-α-amino acids, i.e., without side chains. The backbone chains contain only one hydrogen donor to a hydrogen bond, the amidic hydrogen (N–H), and only one electron pair, which serves as hydrogen acceptor in a hydrogen bond, the free electron pair of the carbonyl (CO). Recall that electron pairs of amide nitrogen on the main chain is “busy” because it is part of the double bond related to one of the resonance forms of the peptide bond (see Fig. 18.1). This means that, in proteins, the only hydrogen bonds stabilizing secondary structures are the ones between the amidic N–H and the carbonyl.

The protein backbone needs to bend in order to stabilize secondary structures, because this is needed for forming hydrogen bonds between amino acids. There are two degrees of freedom that leads to the bending of the main chain. These degrees of freedom are defined by the dihedral angles Φ and Ψ. The dihedral angle Φ among the atoms C\({}_{\alpha }^{i-1}\), N, Cα, and C defines the torsion of the bond N–Cα. The dihedral angle Ψ among N, Cα, C, and N(i + 1) defines the torsion of the bond Cα–C (where C is the carbonyl carbon). It is important to note that the two-dimensional plot of Φ versus Ψ, known as Ramachandran plot, can describe the folding of a protein in the sense that particular pairs (Φ,Ψ) can be identified for each amino acid residue forming the protein. It is also remarkable that the Ramachandran plot defines all the conformational space for the backbone structure of a protein [61]. Note that the torsion of the peptide bond is not considered as a degree of freedom because of its planarity.

In the 1960s, Ramachandran performed some computational calculations on small peptides and showed that not all combinations of Φ and Ψ are possible in proteins. Moreover, there are high-energetical conformations that can be considered as forbidden [61]. On the other hand, Φ and Ψ combinations that can be observed in secondary structure define the lowest-energy conformations. High energetical states are due to steric effects between large side chains. We can say that, for a given amino acid residue, the larger is the side chain, the smaller are the possible low-energy areas in the Ramachandran plot. Figure 18.2 shows the Ramachandran plot of yeast thioredoxin 1 (PDB id:2I9H, [2]) and the location of the main secondary structures.

Fig. 18.2
figure 2

Ramachandran plot. Leftmost plot: Φ and Ψ angles for 500 high-resolution crystal structures selected from PDB: the plot shows a general dihedral freedom adopted in proteins [48]. Plot in the center: superposition of the angles Φ and Ψ extracted from the 20 lower-energy structures of thioredoxin 1 (PDB id:2I9H) solved by solution NMR [60]. Rightmost plot: contour plot showing allowed and generously allowed regions for the angles Φ and Ψ calculated from the initial data set and the superposition of dihedral angles from 2I9H structures

The two most frequent secondary structure elements are α-helices and β-sheets (parallel and antiparallel). It is not in the scope of this chapter to describe all the secondary structure elements but to contextualize in regard to the protein structural determination problem. The two secondary structure elements define the lowest-energy regions of the Ramachandran plot, in which the α-helix region is near \(\Phi= -6{0}^{\,\circ }\) and \(\Psi= -3{0}^{\,\circ }\), and the β-sheet region is near \(\Phi= -12{0}^{\,\circ }\) and Ψ = 135 ∘  (see Fig. 18.2).

3 The Problem of Conformational Space

As we have seen in the previous section, the Ramachandran plot is related to the backbone conformation of a protein. If one knows the dihedral angles Φ and Ψ for all residues, not only the secondary structure is determined, but also the tertiary structure can be derived from this information. The tertiary structure is given by all three-dimensional coordinates of the atoms forming the protein. The quaternary structure defines the structural organization of proteins in oligomers. The oligomerization of a protein can be homo-oligomerization, where the association involves the same amino acid chain, or hetero-oligomerization, where the association occurs with different chains. There are several levels of oligomerization: dimers, decamers/dodecamers, and virus structures, which may contain thousands of chains.

It is important to discuss the forces that stabilize the tertiary and the quaternary structures of proteins. They are mainly represented by intermolecular non-covalent interactions between atoms belonging to the protein backbone or to the side chains of the amino acids. These interactions are generally called tertiary contacts. We give, in the following, some details about the main interaction forces in proteins.

3.1 Hydrogen Bonds

Besides the hydrogen bonding between the amidic group N–H and the carbonyl group (CO) of the backbone that stabilizes the secondary structures, there are many others amino acid side chains that can form hydrogen bonds. Amino acid residues serine, threonine, and tyrosine contain a hydroxyl group (–OH) that can be either donor or acceptor of a hydrogen in a hydrogen bond. Moreover, aspartate and glutamate are carboxylic acids that contain a hydroxyl and a carbonyl (donor and acceptor). Asparagine and glutamine contain amide (–NH2, mainly donor) and a carbonyl (acceptor). Finally, lysine (amine) and arginine (guanidinium group) are also good donors and acceptors for hydrogen bond.

In proteins, there is always competition between intramolecular and intermolecular (the protein solvent is water) hydrogen bonding. The more is the residue exposed to water (near the protein surface), the smaller is the contribution of the intramolecular hydrogen bond to the stabilization of tertiary and quaternary structures.

3.2 Coulomb Interactions

Several side chains can ionize in water, associating or dissociating protons that become charged. At neutral pH, aspartate, glutamate, and the carboxy terminus are negatively charged, whereas lysine, arginine, histidine, and the amino-terminus are positively charged. The proximity of two opposite charges leads to Coulombic interactions that, when present, strongly contribute to the stabilization of tertiary and quaternary structures.

Charged residues are solvated by water. The dipole of water neutralizes the charge. Coulombic interaction, also known as salt bridge, is restricted to protein microenvironments where the water access is limited. Similar to the dependence of the strength of hydrogen bonds on water access, the more exposed water is to the charged residue, the weaker is the intramolecular Coulombic interaction.

3.3 Van der Waals

van der Waals (vdW) interactions are dipolar–dipolar interactions that occur at very short distances (r < 5 Å). Although they are the weakest forces involved in the protein structure stabilization, they are the most important for the tertiary and quaternary structures of proteins because of their high abundance. Every dipole that is close to each other contributes to the protein stabilization.

Water contributes favorably for VdW because the apolar hydrophobic side chains tend to avoid the exposure to the solvent, the so-called hydrophobic effect. In this way, they become part of a hydrophobic core. The exposure of hydrophobic side chains to the bulk water leads to high entropic penalty. The VdW force has two components, one is repulsive, at very short distances (r < 1. 8 Å), which decays proportionally to r  − 12 and the other one is attractive (1. 8 < r < 5 Å), which decays proportionally to r  − 6.

VdW is the “glue” that sticks together the protein structure. Hydrophobic residues are packed in the protein and kept by VdW interactions.

The water contribution is also very important. The exposure of each residue to water determines what kind of interaction is more important for the structure stabilization. Polar residues tend to be found on the surface of proteins and this is the reason why the polar interaction contribution, such as hydrogen bonds and Coulomb interaction, needs to be pondered by the water access.

Water access is also essential for protein dynamics. Polar side chains on the surface of globular proteins have structures that fluctuates among several conformational states. On the other hand, polar side chains, which are packed in the protein core, strongly contributes to the stabilization of the protein structure and are subject to restricted motions and well-defined configurations. The limited access of water increases the interaction energy of intermolecular hydrogen bonds and salt bridges. Apolar side chains in the protein core are packed with restricted motions due to the VdW interactions.

Apolar side chains on the surface of a protein are exposed to water. Any exposure of apolar surface to water leads to entropic penalties due to the super-organization of the water molecules. In order to avoid the entropic penalty, the protein tends to find an alternate organization where the apolar surface is hidden from water. Proteins that contain hydrophobic patches are less soluble in water and/or tend to oligomerize or interact with other proteins.

4 Protein Geometry and Introduction to Molecular Dynamics Simulation

It is not our goal to describe all the details of protein geometry and molecular dynamics simulation, but rather emphasize some of its aspects, that are important in the context of protein structural calculation.

Nowadays, we have the possibility of considering the knowledge accumulated over the last century on molecular structures, and particularly the knowledge about protein structure. Force fields are generally based on simplified versions of the classical mechanical equations that can be defined for each geometry element in the molecule and by each interaction force. The creation, and the continuous improvement, of the force fields enables the simulation of the protein geometry, of the intra- and inter-molecular interactions, and of the protein dynamics.

Simulations of molecular dynamics can be performed by solving Newton’s equation in discrete time steps (known as integration time). The time step must be small enough to not overcome any polypeptide dynamic event, such as vibrations. Typically, the time step is smaller than 5 femtoseconds (fs, that is 5 ×10 − 15s). The mass of each atom, the equilibrium distances, and angles are parameterized in the available force fields [9, 32, 65].

To compute the trajectory at each integration time (dt), the motion equations are obtained using Newton’s second law (\(\vec{F} = m\vec{a}\)). The resulting external forces can be written as the gradient of the potential energy:

$$\vec{F} = -\nabla V.$$
(18.1)

The gradient ( ∇ ) is a vector operator that, when applied on a function, such as V (x, y, z), results in a vector \(\vec{F}\):

$$\nabla V (x,y,z) = \frac{\partial V } {\partial x} \vec{{e}_{x}} + \frac{\partial V } {\partial y} \vec{{e}_{y}} + \frac{\partial V } {\partial z} \vec{{e}_{z}}.$$

The combination of the equations above results in a differential equation that is integrated at each time step in order to obtain the trajectory of motion:

$$\nabla V = -m\frac{{\mathrm{d}}^{2}\vec{r}} {\mathrm{d}{t}^{2}} .$$
(18.2)

Note that the vector force \(\vec{F}\) is obtained for a potential field V (x, y, z). This is the reason why the set of parameters is also called “force field”. The protein structure geometry is defined in the force field by the bond lengths, the bond angles, and by the proper and improper dihedral angles. The nonbonded intramolecular interaction is defined by the nonbonded potential, which mainly considers Coulomb and VdW interactions:

$${V }_{\mathrm{total}} = {V }_{\mathrm{bonds}} + {V }_{\mathrm{angles}} + {V }_{\mathrm{dihedrals}} + {V }_{\mathrm{impropers}} + {V }_{\mathrm{nonbonded}}.$$

The intramolecular interactions with the solvent are also defined by nonbonded terms. The bond and angle potentials are harmonic potentials that model the vibration motion according to Hooke’s law:

$$\begin{array}{rcl} {V }_{\mathrm{bonds}}& =& \sum\limits_{\mathrm{bonds}}{K}_{\mathrm{b}}{(r - {r}_{0})}^{2}, \\ {V }_{\mathrm{angles}}& =& \sum\limits_{\mathrm{angles}}{K}_{\theta }{(\theta- {\theta }_{0})}^{2}, \\ \end{array}$$

where K b and K θ are spring constants for bonds and angles, respectively. r is the generic bond length, while r 0 is bond length at equilibrium. Similarly, θ is the generic bond angle, whereas θ0 is the bond angle at equilibrium.

A proper dihedral defines torsion angles which are formed by four atoms joined contiguously through bonds. It defines the geometry of real dihedrals of the protein. Improper dihedrals define the planarity of aromatic rings and peptide bonds, and they avoid stereo centers to interconvert. They also express torsion angles formed by atoms that are not necessarily connected through bonds. Proper dihedrals are usually expressed as periodic potentials:

$$\begin{array}{rcl} {V }_{\mathrm{dihedrals}}& =& \sum\limits_{\mathrm{dihedrals}}{K}_{\omega }\left [1 +\cos (n\omega- \gamma )\right ], \\ {V }_{\mathrm{impropers}}& =& \sum\limits_{\mathrm{impropers}}\frac{1} {2}{K}_{\xi }\left [{\xi }_{ijkl} - {\xi }_{0}\right ].\end{array}$$

K ω and K ξ are force constants. ω is the proper dihedral angle and γ is a phase of the periodic potential. ξ ijkl is the generic improper dihedral angle and ξ0 is the improper dihedral at equilibrium.

The nonbonded potentials are defined as following (the first term represents the Coloumb forces, while the second one represents the VdW forces):

$${V }_{\mathrm{nonbonded}} =\sum\limits_{i,jpairs}\frac{{q}_{i}{q}_{j}} {\epsilon {r}_{ij}} +\sum\limits_{i,jpairs}\left ( \frac{{A}_{ij}} {{r}_{ij}^{12}} - \frac{{B}_{ij}} {{r}_{ij}^{6}}\right ),$$

where q i is the charge of the atom, ε is the electrical permittivity constant, r ij is the distance between the two atoms i and j, and A ij and B ij are two constants related to the Lennard-Jones potential, modeling the VdW forces. We remark that other potentials, modeling, for example, the hydrogen bonds, can also be defined in force fields.

Tables 18.1 and 18.2 show, as an example, the force field and the topology implemented in the XPLOR-NIH and CNS.

Table 18.2 Selected parts of the PARALLHDG force field (parallhdg5.1.param) [45]
Table 18.1 Selected parts of the topology table used by XPLOR-NIH and CNS. We consider the topology of the serine and report the atom type, charge, bonds description and atoms involved in proper and improper dihedral angle definitions. Note that the improper torsion angles define chirality and stereoisomery of the aminoacid.

Note that the topology of each amino acid (we consider the serine in the tables) is defined by the atomic weight, by the charge, and by the covalent connection of each atom. The force field is defined by the parameters (bond, angle, proper and improper dihedrals, and nonbonded interaction) that enables the calculation of all the potentials listed above.

In the next section, we briefly describe conceptual aspects of NMR that help in understanding how to use and convert NMR experimental data in distance restraints. NMR and molecular dynamics simulation, along with other computational methods, can be considered as good partners, in the sense they are complimentary. NMR experiments provide essential structural and dynamical information for parameterization and improvements of the computational methods, while the computational methods provide a unique way to interpret the experimental data.

5 Introduction to Nuclear Magnetic Resonance

NMR is a spectroscopy that deals with the nuclear spin and its interaction with magnetic field. Several nuclei are magnetically active, in the sense that they have an associated magnetic moment. Among the magnetically active nuclei, 1H, 13C, and 15N are the most important probes for protein NMR (see Table 18.3).A small protein containing about 100 amino acids approximately contains 2,000 hydrogens, 500 carbons, and 130 nitrogens. Each of these nuclei can be unambiguously assigned, providing precious information. The main physical properties obtained from NMR experiments are chemical shift, scalar coupling, and dipolar interaction (from dipolar coupling).

Table 18.3 Physical properties of some magnetically active nuclei commonly used in protein NMR

In practice, proteins prepared for structural determination are enriched with the nuclei presented in Table 18.3. To this purpose, the protein is biosynthesized by a bacterium (among other cells) and grown in an isotope-labeled medium [20].

The magnetism is a consequence of the spin angular momentum. Nuclear magnetism is caused by the nuclear spin. Magnetic active nuclei has a magnetic moment μ, which is associated to the nucleus that is described by the nuclear spin angular momentum \(\vec{I}\). They are collinear and proportional to each other:

$$\mu= \gamma \hslash \vec{I}.$$

The proportionality constant is the magnetogyric ratio γ multiplied by the Planck constant \(\hslash= h/2\pi \). See Table 18.3.

The nuclear spin angular momentum, the vector \(\vec{I}\), has the following magnitude:

$$\vert \vec{{I}}^{2}\vert=\vec{ I} \cdot \vec{ I} = {\hslash }^{2}\left [I(I + 1)\right ],$$

where I is the spin angular momentum quantum number.

The spin is a quantum entity without classical analog. Nevertheless, it is useful to use a semiclassical representation based on classical angular momentum to build up a geometric representation of the spin (see Fig. 18.3).

Fig. 18.3
figure 3

A schematic representation of the angular momentum of nuclei with nuclear spin angular momentum \(I = 1/2\). The vector \(\vec{I}\), in black, shows the two quantum states, while the vectors in grey represent its projection on the z axis. The projection on z can be determined when there is uncertainty in the projection on the xy plane. The uncertainty is represented by the dotted grey line. It implies that \(\vec{I}\) can be projected in any position of the xy plane

Only one component of the angular momentum \(\vec{I}\), I x , I y , or I z , can be determined simultaneously with its magnitude \(\vert \vec{{I}}^{2}\vert \). By convention, the value of the z component I z is specified by the equation

$${I}_{z} = \hslash m,$$

where m is the magnetic quantum number that can have the following values:

$$m \in \{-I,-I + 1,-I + 2,\ldots ,I - 2,I - 1,I\}.$$

For a nucleus with \(I = 1/2\), \(\vec{I}\) adopts two orientation. There is certainty in the projection I z and uncertainty in I x and I y . I z can be either in + z (\(m = 1/2\)) or in − z (\(m = -1/2\)). The magnitude of \(\vec{I}\) and I z are

$$\vert \vec{I}\vert= \frac{\hslash \sqrt{3}} {2} ,\quad {I}_{z} = \frac{\hslash } {2},\quad {I}_{z} = -\frac{\hslash } {2}.$$

The energy of the interaction of the magnetic moment (\(\mu= \gamma \vec{I}\)) in the presence of an external static magnetic field (\(\vec{B}\)) is proportional to the scalar product of μ and \(\vec{B}\):

$$E = -\mu \cdot \vec{ B}.$$

Both are vectorial quantities and the energy is dependent on the relative orientation of these two vectors. Figure 18.4 shows the energy diagram for a spin \(I = 1/2\).

Fig. 18.4
figure 4

Diagram representing the energy levels of a nuclear spin \(I = 1/2\). Note that, at equilibrium, the high-energy level is less populated than the low-energy one. The arrows represent the projections on z (up and down). The up arrows indicate spins at the lower-energy state (the z-projection is parallel to the main magnetic field) while the down arrows are antiparallel to the static magnetic field (high-energy state)

When field \(\vec{B}\) is applied along the z direction, the energy becomes

$$E = -\gamma Bo{I}_{z} = -m\gamma \hslash Bo,$$

where Bo is the magnitude of the magnetic field \(\vec{B}\) along z direction. So, for a spin \(I = 1/2\):

  • The quantum state \(m = 1/2\), which is parallel with Bo, is the minimum energy state ( | α > state) with \(E = -\gamma \hslash Bo/2\).

  • The quantum state \(m = -1/2\), which is antiparallel with Bo, is the maximum energy state ( | β > state) with \(E = \gamma \hslash Bo/2\).

The difference in energy is

$$\Delta E = \hslash \gamma Bo.$$

Note that the energy difference is proportional to Bo. The energy states are degenerate (ΔE = 0) in absence of the magnetic field.

We have so far discussed about isolated spins only. For an ensemble of spins, we need to consider the vectorial sum of the magnetic moment for each spin in the ensemble. In an ensemble, the x and y components of the magnetic moment are canceled. At thermal equilibrium, the lowest energy state is more important: following a Boltzmann distribution as \(\Delta E > 0\) in presence of a static magnetic field. This gives rise to a macroscopic magnetic component along the z axis that is the result of the sum over all spins of the ensemble. This is called magnetization vector \(\vec{M}\) (see Fig. 18.5). Note that \(\vec{M}\) is zero in absence of an external magnetic field and gets polarized (\(\vert \vec{M}\vert> 0\)) in presence of the magnetic field.

Fig. 18.5
figure 5

Effect of a radiofrequency pulse represented by its magnetic component \(\vec{{B}_{1}}\) on the magnetization vector \(\vec{M}\). The figure illustrates the nutation of the \(\vec{M}\) around \(\vec{{B}_{1}}\) at the rotating frame. Since the pulse is applied in x, the nutation occurs in the zy plane. At the laboratory frame \(\vec{{B}_{1}}\) rotates in the xy plane at the frequency of the applied pulse. The rotating frame is a frame of reference that rotates around the z axis at the same frequency of the applied rf pulse (ω0). At the rotating frame \(\vec{{B}_{1}}\) is static

The NMR experiment consists of applying a radiofrequency pulse with one quantum of energy (\(\Delta E = \hslash \omega= \hslash \gamma Bo\)) and consequently of changing the population balance of the energy states. The magnetic component of the radiofrequency pulse \(\vec{{B}_{1}}\) is applied on the xy plane. Figure 18.5 illustrates the magnetic component of the pulse causing the nutation of \(\vec{M}\) at the rotating frame. Nutation consists of the evolution of \(\vec{M}\) around \(\vec{{B}_{1}}\).

The energy of the radiofrequency pulse is

$${E}_{\mathrm{rf}} = \hslash {\omega }_{0}.$$

The resonance condition is

$$\begin{array}{c} {E}_{\mathrm{rf}} = \Delta E \Rightarrow\hslash {\omega }_{0} = \hslash \gamma Bo \Rightarrow{\omega }_{0} = \gamma Bo,\\ \end{array}$$

where ω0 is the Larmor frequency.

The nutation angle of the magnetization is controlled by the rf irradiation time. The spectroscopist calibrates the time necessary for nutating the magnetization at \(9{0}^{\,\circ }\) (M z  = 0, M xy  = 1) or at 180 ∘  (\({M}_{z} = -1\), M xy  = 0), or at any other nutation angle. The calibrated pulse width is then used to set up the pulse sequences necessary for data collection for structure determination.

After excitation with the rf pulse, the transmitter is turned off. The magnetization is free to evolve back to equilibrium, precessing at the Larmor frequency around Bo. The frequency of evolution is detected by the receiver, transformed from time to frequency domain by a Fourier transform, which generates the NMR spectrum. Each spin in the ensemble displays in the spectrum. The NMR spectrum contains information of each spin present in the sample (see Fig. 18.6).

Fig. 18.6
figure 6

Typical NMR spectrum of a protein. Ranges of chemical shifts expected for the various types of 1H resonances

The differences in the electronic density in different molecules or parts of those structures cause the magnetic field to vary on a submolecular distance scale. This effect is called chemical shift and is extremely important for the application of NMR spectroscopy to study the molecules. In order to understand this effect, it is important to know how the electronic density of a molecule responds to the application of a static field \(\vec{B}\).

As showed in Fig. 18.7,the mechanism that leads to chemical shift can be simplified in a two-step process:

  1. 1.

    The external magnetic field induces currents in the electron clouds of the molecule.

  2. 2.

    These generated currents induce a magnetic field which can be added vectorially to the static field \(\vec{{B}^{\mathrm{ind}}}\):

    $$\vec{{B}^{loc}} =\vec{ B} +\vec{ {B}^{\mathrm{ind}}}.$$
Fig. 18.7
figure 7

A schematic representation of an atom, which illustrates the nucleus and the effect of the rotation of the electrons inducing a magnetic field B ind which is antiparallel to the static magnetic field

Some important information about \(\vec{{B}^{\mathrm{ind}}}\) follows. First, the induced field is approximately linearly dependent on the applied field. Second, the magnitude and direction of some induced magnetic field is dependent on the shape of the molecule and on the location of the nuclear spin in the protein. Assuming these facts, we can write the induced magnetic field as follows:

$$\vec{{B}^{\mathrm{ind}}} = -\sigma\cdot \vec{ B},$$

where σ is called shielding tensor, represented by a 3 ×3 square matrix. Note that σ is not a vector.

6 Experimental Restraints Generated by NMR

The main information for protein structural calculation is the nuclear Overhauser effect (NOE). NOE was first observed by Albert Overhauser in 1953 [54]. As previously observed, ensembles of spins get polarized in the presence of an external magnetic field. When two or more spins are near in space, only few angstroms apart, they become coupled (dipolar coupling). Under this condition, they can exchange polarization, affecting the intensities of the resonances of each of the spins. The dipolar coupled spins do not relax independently. The polarization transfer occurs via auto-relaxation but also through cross-relaxation.

Cross-relaxation mix populations between the two spins. The NOE is used to correlate spins through space [36]. The pulse sequence Nuclear Overhauser Effect SpectroscopY (NOESY) is the most important source of restraints [76]. The cross-peaks in a NOESY spectrum provide the distance information between two hydrogens in a protein. The intensity of the NOE cross-peak (I NOE) is proportional to the distance between two hydrogens (the atoms i and j) and depends on the cross-relaxation rate:

$${I}_{\mathrm{NOE}} = \alpha\frac{1} {\langle {D{}_{ij}\rangle }^{6}},$$

where α is the proportionality constant and \(\langle {D}_{ij}\rangle\) is the time averaged distance between the two hydrogens. Note that the intensity drops with the sixth power of the distance. Only distances smaller than 6 Å can be therefore measured.

The parameter α contains information on the dynamics of the system (α = f(τ)). τ is the effective correlation time of the nuclei and contains the information about the internal dynamics of each hydrogen, as well as the global dynamics of the protein, such as the overall rotational correlation time. τ cannot be quantitatively treated for each individual hydrogen, and, thus, the NOE information is used in a semiquantitative way. Instead of giving exact distance information, NOEs give ranges of distance, i.e., a lower and upper bound on the actual distance.

There are methods following the local dynamics using a relaxation matrix. These methods provide better-quality distance information, but they still give only time-averaged distances [46].

The step of transforming the NOE intensities into ranges of distances is known as calibration. There are several ways to calibrate NOEs. The most frequent way is to use NOE intensities (or volumes) of hydrogen pairs of known secondary structure elements. The distances of those pairs are indeed well known. One can calculate a certain parameter on the basis of these distances and use the same parameter for all NOEs. This method is the most used for initial protein calculation.

A different NOE calibration method can be used during refinements. At this stage of protein calculations, the structure is already known. Thus, the distances extracted from the structures can be used for NOE calibration.

6.1 Scalar Coupling (J)

The other source of information in the NMR experiments is the scalar couplings (J). Differently from the dipolar coupling that occurs through space, the scalar coupling occurs through bonds. J coupling can be through one, two, or three bonds (1 J, 2 J, 3 J). One-bond J coupling are typically heteronuclear, such as the coupling between amidic nitrogen and hydrogen (\({}^{1}{J}_{{}^{15}\mathrm{N}{-}^{1}\mathrm{H}}\)). Two-bond J coupling occurs between geminal hydrogens, such as CH2.

Finally, three-bond J coupling are the most important for structural information. Their value gives information about dihedral angles. For instance, the coupling between the amidic hydrogen and alpha hydrogen (\({}^{3}{J}_{{\mathrm{H}}_{N}-{\mathrm{H}}_{\alpha }}\)) depends on the Φ angle of the Ramachandran plot. Figure 18.8 shows the Karplus relation [33] of the dependence of \({}^{3}{J}_{{\mathrm{H}}_{N}-{\mathrm{H}}_{\alpha }}\) with Φ. There are several NMR experiments designed to measure several dihedrals of a protein.

Fig. 18.8
figure 8

Karplus Plot of \({}^{3}{J}_{{\mathrm{H}}_{N}-{\mathrm{H}}_{\alpha }}\) (in Hz) versus the torsion angle Φ. The grey solid curve is the best fit of equation parameters (top of the figure) where \(\theta= \vert \Phi- 60\vert \). Values for regular secondary structures are indicated for α-helix (circle at − 57 ∘ , 3.9 Hz), 310 helix (inverted triangle at − 60 ∘ , 4.2 Hz), antiparallel β-sheet (square at − 139 ∘ , 8.9 Hz), and antiparallel β-sheet (triangle at − 119 ∘ , 9.7 Hz) [55]. The region on the left delimited by the dotted green line (\(-3{0}^{\,\circ },-18{0}^{\,\circ }\)) concentrates dihedral angles (Φ) of all amino acids (exception made for the glycines) [75]

6.2 Chemical Shift

As previously shown, chemical shifts are dependent on the microenvironment. They are very sensitive to small changes. A correlation between chemical shifts of hydrogen alpha (Hα), carbon alpha (13Cα), carbon beta (13Cβ), and the carbonyl (13C) and the secondary structure has been established. It consists in a very important structural information, because after resonance assignments of a protein, it becomes straightforward to determine its secondary structure elements based solely on chemical shifts. Table 18.4 summarizes the correlation between each of the nuclei and the chemical shift.

Table 18.4 The correlation between chemical shifts and secondary structures of proteins

6.3 Residual Dipolar Couplings

As previously observed, the dipolar coupling is responsible for the mechanism of polarization transfer through cross-relaxation, which leads to the NOEs. However, dipolar couplings cannot be measured in the NMR spectra because of the isotropic molecular tumbling.

In the 1990s, Prestegards and collaborators solubilized proteins in anisotropic media and showed that the residual orientation of the protein was able to recover dipolar coupling information. Anisotropic media consist of colloidal phases, such as bicelles and liquid crystals, or bacteriophages, such as Pf1, which are spontaneously oriented in the magnetic field. They restrict the Brownian motion of proteins in a way that induces a residual orientation due to the intrinsic anisotropic shape of the protein (see Fig. 18.9). Still the proteins keep tumbling fast, maintaining all the good behavior in of sharp lines, necessary for solution NMR.

Fig. 18.9
figure 9

(a) A protein structure (yeast thioredoxin, PDB id: 2I9H) showing the calculated molecular alignment tensors A xx , A yy , A zz , as well as the representation of a dipolar vector (the NH vector in this case). By definition, \({A}_{zz} > {A}_{yy} > {A}_{xx}\). The principal molecular alignment tensor is therefore Azz. (b) Representation of the dipolar vector (the NH vector) in the molecular orientation frame of reference

Still the proteins keep tumbling fast in solution, maintaining all the good-behavior in solution of sharp lines, necessary for solution NMR. The residual orientation induces the reappearance of the dipolar coupling in solution. The residual dipolar coupling constant depends on the degree of orientation of the protein in the anisotropic media. The spectroscopist is able to tune the line shape and the degree of orientation, changing the concentration and other properties of the anisotropic media.

Dipolar coupling depends on the angle between the dipolar vectors with the main static magnetic field. This is true for a static oriented sample. Proteins dissolved in anisotropic media are not static. In this case, the residual dipolar coupling (RDC) does not depend directly on the angle of the dipolar vector with the static magnetic field, but RDCs are the measure of the angle of the dipolar vector with the principal molecular alignment tensor.

The principal molecular alignment tensors can be measured experimentally and also calculated from the molecule shape (Fig. 18.9). Thus, RDCs can be considered as an experimental restraint. This is a good quality restraint because it is a long-range angular restraint. RDCs have been used extensively as a refinement tool and their use allows for improving the geometric quality of the structures [44].

7 Experimental Pseudo-potentials

We introduce in this section some experimental pseudo-potentials based on the information obtained by NMR experiments. We describe NOEs as distance restraints, scalar coupling and chemical shifts as short-range angular restraints (proper dihedrals), and RDCs as long-range angular restraints. There are other sources of restraints that we do not discuss here: paramagnetic restraint, which are long-range distance restraints [15], chemical shift anisotropy restraints [42, 43, 74], among others.

The general strategy is to transform the experimental information into pseudo-potentials that can be used in the structural calculations. Next, we describe some pseudo-potential for each information obtained experimentally.

7.1 NOEs: Distance Restraints

After NOE calibration, the list of NOEs serves as an input for structural calculation. The NOE assignment list contains the specification of the hydrogen pair and the distance information, determining a lower (L ij ) and upper bound distances (U ij ). The lower bound is approximately 1. 8 Å, which is the shortest possible distance between two hydrogens, accordingly to their atomic VdW radii. The upper bound distance depends on the target distance calculated from NOE calibration. Typically the distance restraints are assigned in classes: weak (U ij  = 6 Å), medium (U ij  = 3. 4 Å), and strong (U ij  = 2. 8 Å). The interval for each class is somewhat arbitrary and can vary from author to author.

Quadratic Pseudo-potential The pseudo-potential for NOE can be defined as follows. It gives no energy penalty when the distance between the two hydrogens (i and j) is contained in the interval \([{L}_{ij},{U}_{ij}]\). The potential increases quadratically when r does not belong to the given interval:

$${ V }_{ij} = \left \{\begin{array}{l@{\quad }l} {C}_{1}{(r - {L}_{ij})}^{2},\quad &\mathrm{if}\ \ r < {L}_{ij} \\ 0, \quad &\mathrm{if}\ \ {L}_{ij} < r < {U}_{ij} \\ {C}_{2}{(r - {U}_{ij})}^{2},\quad &\mathrm{if}\ \ r > {U}_{ij},\\ \quad \end{array} \right .$$
(18.3)

whereC 1 and C 2 are force constants that control the steepness of the energy pseudo-potential.

Biharmonic Pseudo-potential The pseudo-potential for NOE can also be defined as a function of a unique target distance D ij that can be calibrated from NOE intensities. In this case, the pseudo-potential is defined as follows:

$${V }_{ij} = \left \{\begin{array}{l@{\quad }l} {C}_{1}{(r - {D}_{ij})}^{2},\quad &\mathrm{if}\ \ r > {D}_{ij} \\ {C}_{2}{(r - {D}_{ij})}^{2},\quad &\mathrm{if}\ \ r < {D}_{ij},\\ \quad \end{array} \right .$$

where C 1 and C 2 are force constants that are weighed by the thermal energy (K b T) available in the computational system:

$${C}_{1} = {S}_{1}\frac{{K}_{\mathrm{b}}T} {2} \quad \mathrm{and}\quad {C}_{2} = {S}_{2}\frac{{K}_{\mathrm{b}}T} {2}$$

where K b is the Boltzmann constant and T is the absolute temperature of the system. Note that the potential is not zero when r is within the interval defined by a lower and an upper bound. S 1 and S 2 are scale factors.

7.2 Dihedral Restraints

Dihedral restraints can be incorporated in the structural calculation. They are obtained from scalar coupling measurements and chemical shift information. For each dihedral restraint, we have the target dihedral θtarget and the permitted variation Δθ, which is usually relatively large. This way, it allows the dihedral conformational space to vary freely within the low-energy Ramachandran area.

Pseudo-potential for dihedral angle is defined as follows:

$${V }_{\mathrm{dihedral}} = \left \{\begin{array}{l@{\quad }l} {C}_{1}{(\theta- {\theta }_{\mathrm{target}})}^{2},\quad &\mathrm{if}\ \ \theta< {\theta }_{\mathrm{target}} - \Delta \theta\\ 0, \quad &\mathrm{if}\ \ {\theta }_{\mathrm{target}} - \Delta \theta< \theta< {\theta }_{\mathrm{target}} + \Delta \theta\\ {C}_{2}{(\theta- {\theta }_{\mathrm{target}})}^{2},\quad &\mathrm{if}\ \ \theta> {\theta }_{\mathrm{target}} + \Delta \theta ,\\ \quad \end{array} \right .$$

whereC 1 and C 2 are the two force constants.

7.3 Scalar J-Coupling Restraints

The pseudo-potential energy term for scalar coupling makes use of the Karplus relation. This equation uses the dihedral angle θ obtained at each time step of structure calculation to obtain the calculated scalar coupling (J calculated).

$$J = {A\cos }^{2}(\theta+ P) + B\cos (\theta+ P) + C,$$

where A, B, and C are the Karplus coefficients and P is a phase. It then uses J calculated to create a pseudo-potential V J by comparing it to the experimental J coupling (J observed). The pseudo-potential is defined as follows:

$${V }_{J} = C{({J}_{\mathrm{calculated}} - {J}_{\mathrm{observed}})}^{2},$$

where C is the force constant.

7.4 Chemical Shift Restraints

1H and 13C chemical shifts correlate with the angles Φ and Ψ and can define secondary structure elements. Several implementations on protein structural calculation include harmonic potentials for chemical shifts. The X-PLOR-NIH package for protein structural calculation [64] includes pseudo-potentials for Cα and Cβ chemical shifts [37]. It also includes pseudo-potentials for non-exchangeable hydrogens. Chemical shifts are calculated on the basis of semiempirical methods, where random coil values, ring currents, magnetic anisotropy, and electric-field chemical shifts are considered. The experimental chemical shift is compared to the predicted one from the structure, and the pseudo-potential takes care of refining the structure to agree with chemical shifts [37, 38].

The most used strategy to take into account chemical shifts is through the prediction of the Φ and Ψ dihedral angles. The program TALOS [66] uses a combination of six chemical shifts information: \({\delta }_{{\mathrm{H}}_{\mathrm{N}}}\), \({\delta }_{{\mathrm{H}}_{\alpha }}\), \({\delta }_{{\mathrm{C}}_{\alpha }}\), \({\delta }_{{\mathrm{C}}_{\beta }}\), \({\delta }_{{\mathrm{C}}^{{\prime}}}\), and \({\delta }_{\mathrm{N}}\). The program is based on a search on a database containing 200 high-resolution protein structures, containing sequence information, Φ and Ψ torsion angles, and chemical shift assignments. It looks for chemical shift similarities between a certain residue and the two adjacent residues (triplets of residues). It always uses triplets of residues to predict backbone torsion angles of a given residue. If there is a consensus of Φ and Ψ angles among the ten best database matches, then TALOS uses these database triplet structures to form a prediction for the backbone angles of the target residue.

Based on the matches, TALOS calculates a consensus for Φ and Ψ angles (Φ target and Ψ target). The values of Φ target and ΔΦ and Ψ target and ΔΨ are included as dihedral angle restraints. The accuracy of TALOS predictions is about 89%. Most of the errors occur in regions of the Ramachandran that does not define secondary structure elements. TALOS prediction can thus be used reliably for secondary structure elements.

7.5 Residual Dipolar Coupling Restraints

As observed before, partial orientation of macromolecules in anisotropic media allowed the detection of RDCs. RDCs are good quality restraints because they define angles between a bond vector and the principal molecular alignment tensor (see Fig. 18.9). In order to compute RDCs, it is necessary to use an external orientational axis that is the reference for the angle measurement between the bond vectors. The implementations of RDC pseudo-potentials in the program Xplor-NIH can take into account dipolar vectors between atoms that are directly bonded (such as N–H or C–H bonds), or more flexible situations where the dipolar vector is between atoms not directly bonded, such as 1H-1H dipolar couplings. 1H–1H dipolar couplings are more difficult since 1H-1H distances can vary. In this chapter, we describe only the directly bonded RDCs. For more detailed information on other implementations, the reader is referred to [1, 1012, 49, 63, 70, 71].

A necessary step is the calculation from the structure of the rhombicity and of the amplitude of the molecular alignment tensor. This is accomplished from the shape of the molecule. The molecular alignment tensors from experimental RDC are obtained from the following equation:

$$\mathrm{RDC}(\theta ,\Phi ) = {A}_{\mathrm{a}}\left \{({3\cos }^{2}\theta- 1) + \frac{3} {2}R{(\sin }^{2}{\theta \cos }^{2}\Phi )\right \},$$

where θ and Φ are the polar angles of the dipolar vector in the molecular frame of reference (see Fig. 18.9), the axial A a and radial A r components, and rhombicity R are defined as follows:

$${A}_{\mathrm{a}} = \frac{1} {3}\left \{\frac{{A}_{zz} - ({A}_{yy} + {A}_{xx})} {2} \right \},\quad {A}_{\mathrm{r}} = \frac{{A}_{xx} - {A}_{yy}} {3} ,\quad R = \frac{{A}_{\mathrm{r}}} {{A}_{\mathrm{a}}}.$$

The pseudo-potential is defined as a quadratic harmonic potential:

$${V }_{\mathrm{RDC}} = {K}_{\mathrm{RDC}}{({\mathrm{RDC}}_{\mathrm{calculated}} -{\mathrm{RDC}}_{\mathrm{observed}})}^{2}.$$

More frequently, θ, the angle between the internuclear dipolar vector and the reference external vector, which represents A zz in the calculation, is obtained with a good precision. The rhombic component is usually not precise enough to be used in the calculation. Thus, in practice, RDCs are able to define a cone with angle ± θ around the principal component of the molecular axis. Of course, the lack of precision in Φ limits the restraining ability of RDCs.

So far, we provided a description of pseudo-potentials which are based on experimental restraints obtained by NMR. In the next sections, we describe some computational solutions for calculating protein structures by using the NMR experimental information.

8 Distance Geometry Methods

The most important aspect for protein structure determination by NMR is the exploration of the conformational space imposed by the experimental restraints. X-ray diffraction of a single crystal generates an electron density map, which directly provides structural information. In contraposition, NMR experimental restraints are not able to give structural information, but rather short-range distances and dihedral angles restraints. The result of such a calculation is not a single structure, as for X-ray diffraction, but a set of structures that are all able to satisfy the experimental restraints.

As discussed earlier, NMR experimental restraints consist of semi-quantitative short-range distances and angles information. The structural calculation uses ranges of distances and angles, rather than precise measurements. NMR distance and angle restraints provide upper and lower bounds for both distances and angles.

Ideally, the measurement of precise long-range (in the order of the radius of gyration) distances or angles generates higher-quality restraints. However, this kind of restraints is difficult to measure by NMR. RDCs are better-quality restraints because they give information about long-range angles, but their application is restricted. In fact, only θ angles can be measured with precision. Nevertheless, the inclusion of RDCs in the structure calculation has a dramatic effect on the geometric quality [68]. Recent advances in solid state NMR and paramagnetic relaxation enhancement experiments (PRE) in solution introduced some better-quality long-range distance restraints [21, 31, 40].

What makes structure determination by NMR possible is the fact that the number of short-range distance restraints is generally much larger than the degrees of freedom. There are two degrees of freedom per amino acid residue in the protein backbone (Φ and Ψ dihedral angles), and, typically, good NMR experiments are able to provide more than 15 short-range restraints per amino acid residue.

NMR structure determination is not a computationally simple problem. The lack of precise distances and angles avoid the solution by fast geometric algorithms [35]. The computational solution was the inclusion of an all-atom model with all the known protein geometric angle and distances information along with the semiquantitative short-range experimental information. This approach made it possible to obtain the structures of globular proteins.

In the following, we briefly introduce the computational tools that have been particularly conceived in order to tackle with the problem of exploring the whole conformational space imposed by the imprecise experimental restraints.

The most naive way to explore the whole conformational space is to build a systematic grid of potential conformations and exhaustively explore it. However, this method can be applied only to small peptides [67]. Later we consider again this idea in the context of torsion angle simulated annealing.

The problem of finding the structure of a molecule from some distance and angle restraints is known in the scientific literature as the (molecular) distance geometry problem. Many methods and algorithms have been developed over the past last years for an efficient solution of this problem. The first method for distance geometry dates back to the 1970s. The basic idea is to define a penalty function which is able to measure the satisfaction of the available restraints, and to optimize this penalty function. One of the advantages is that the minimum value of the penalty function (corresponding to the optimal structure satisfying all restraints) is known a priori, because, when the data are correct, it must be ideally zero. If there is no geometric solution with error near zero, it is a strong evidence of systematic errors in the experimental data [26, 27].

The first method for distance geometry makes use of the metric matrix \(\vec{G}\), from which it is possible to obtain the Cartesian coordinates of the atoms of the molecule by exploiting the available set of distances between some pairs of atoms. The relation between the elements G ij of the metric matrix \(\vec{G}\) and the Cartesian coordinates of the two atoms i and j is given by

$${G}_{ij} =\vec{ {r}_{i}} \cdot \vec{ {r}_{j}}.$$
(18.4)

In the matrix \(\vec{G}\), the diagonal elements are the squares of the Cartesian coordinates of the atom i, whereas the off-diagonal elements represent the projection of \(\vec{{r}_{i}}\) over \(\vec{{r}_{j}}\). The square of the Cartesian coordinates of the atom i can be viewed as an vector, defined by the position of i and the origin (0, 0). The diagonal elements can be seen the norm of the vector \(\vec{{r}_{i}}\), which defines the position of each atom in relation to the origin.

As it is well known, the dot product can be written as

$${G}_{ij} = \vert {r}_{i}\vert \vert {r}_{j}\vert \cos \theta ,$$

where θ is the angle between the two vectors. Such an angle is 0 for diagonal elements, nonzero for off-diagonal elements.

The metric matrix \(\vec{G}\) is built by considering all N ×N possible distances for the set of N atoms. The elements of the metric matrix are obtained through the relations

$${G}_{ii} = \frac{1} {N}\sum\limits_{j}^{N}{D}_{ ij}^{2} - \frac{1} {2{N}^{2}}\sum\limits_{jk}^{N}{D}_{ jk}^{2},\quad {G}_{ ij} = \frac{1} {2}\left ({G}_{ii} + {G}_{jj} - {D}_{ij}^{2}\right ),$$

where D ij is the distance between the atoms i and j, and N is the total number of atoms. The metric matrix is positive semi-definite and has rank 3. All eigenvalues are positive or zero and at most three eigenvalues are different from zero.

The general metric matrix decomposition equation is used for the diagonalization, which is necessary to find the coordinates of each atom:

$${G}_{ij} =\sum\limits_{\alpha =1}^{n}{\lambda }_{ \alpha }{E}_{i}^{\alpha }{E}_{ j}^{\alpha }.$$
(18.5)

\({E}_{i}^{\alpha }\) and \({E}_{j}^{\alpha }\) are the eigenvectors and λα is the eigenvalue of the matrix; n is the dimensionality of the system.

The combination of Eqs. (18.4) and (18.5) leads to the following equation, which enable the calculation of the three-dimensional coordinates of the points of the system from the metric matrix elements:

$${r}_{i}^{\alpha } = \sqrt{{\lambda }_{ \alpha }}{E}_{i}^{\alpha }.$$

It is implicit in the equations the assumption that every distance is referenced to the origin (0, 0). In general, one of the atoms, say the one labeled with 1, is set to the origin.

As discussed before, the distance information is generally given by a list of lower and upper bounds:

$${L}_{ij} < {D}_{ij} < {U}_{ij}.$$

The basic steps of the first method for distance geometry are [25]:

  1. 1.

    Bound smoothing—consists of extrapolating the tightest possible bounds on the incomplete list of interatomic distances

  2. 2.

    Metrization—tries to find a matrix of exact values within the lower and upper bound

  3. 3.

    Embedding—computes the coordinates of all atoms of the protein

  4. 4.

    Optimization—minimizes the penalty function value, i.e., the measure of the violation of both lower and upper bounds on the distances, where some geometric constraints of proteins are also considered

We give the details of these four main steps in the following.

8.1 Bound Smoothing

Metric matrix distance geometry algorithms work with exact distances (derived from bond lengths and angles) and NMR experimental data, which are non-exact distances. In the first implementation of algorithms for distance geometry, the distances were chosen independently and randomly within the available lower and upper bounds.

Successively, a bound smoothing was developed for choosing better distances. The technique is based on the fact that interatomic distances always obey triangle inequalities. In fact, the triangle inequality theorem states that any side of a triangle is always shorter than the sum of the two other sides. For a triplet of atoms (i, j, k), it follows that

$${L}_{ik} - {U}_{kj} \leq{D}_{ij} \leq{U}_{ik} + {U}_{kj}.$$

Note that triangle inequality theorem imposes some constraints on D ij . Many algorithms for distance geometry consider these inequalities for all possible triplets (i, j, k) in order to obtain the so-called triangle inequalities bounds.

Another relation that could be used for bound smoothing is given by the tetrangle inequalities. The tetrangle inequality is similar to the triangle inequality, but it considers quadruplets of atoms, not triplets. It is able, in general, to provide tighter bounds on D ij , but it is much more expensive from a computational point of view.

8.2 Metrization

The metrization procedure can be used to improve the geometrical consistency of the randomly chosen distances. We suppose that all distances were chosen from bounds previously processed by a bound smoothing technique (based on triangle and/or tetrangle inequalities). The metrization is based on the construction of distance matrices whose elements respect two rules:

  1. 1.

    Their lower and upper bounds satisfy the triangle and the tetrangle inequalities.

  2. 2.

    The chosen distances satisfy the triangle inequality.

The second rule ensures that later interatomic distance choices are consistent with earlier ones. The metrization imposes interdependency between the randomly chosen distances (they are, in fact, not completely independent to each other).

8.3 Embedding

The initial distances are chosen as an exact distance contained in the interval defined by the corresponding lower and upper bounds. The metric matrix is calculated, and it frequently results in a non-embeddable matrix in the three-dimensional space. This means that the matrix is not positive semidefinite, i.e., the solution is inconsistent with any conformation in the three-dimensional space.

The main aim is to identify an embeddable metric matrix in three dimensions. Within the bound distances, there is a metric matrix in which the absolute values of the three largest eigenvalues are positive, and their corresponding eigenvectors contain the Cartesian coordinates of the atoms of the molecule. If these values are not positive, the chosen distances are not consistent, and the embedding cannot be performed.

8.4 Optimization

This step consists in improving the quality of the protein structure found during the embedding. To this aim, a penalty function (measuring the violations of lower and upper bounds, as well as some geometrical deviations) is defined and optimized. This penalty function must obey to the following rules:

  1. 1.

    Must be nonnegative

  2. 2.

    Must be zero when all the geometric constraints are satisfied

  3. 3.

    Must be twice differentiable in its whole domain

An example of penalty function is

$$F(x) =\sum\limits_{ij}{A}_{ij}^{2}(x) +\sum\limits_{ij}{B}_{ij}^{2}(x) +\sum\limits_{ijkm}{C}_{ijkm}^{2}(x),$$

where:

  • \({A}_{ij}^{2}(x) = 0\) if and only if the distance between nonbonded pairs of atoms (i, j) is larger than their hard VdW sphere radii.

  • \({B}_{ij}^{2}(x) = 0\) if and only if the distance between the pair of atoms (i, j) restrained by experimental data lies within the corresponding lower and upper bound.

  • \({C}_{ijkm}^{2}(x) = 0\) if and only if the angle (i, j, k, l) respects the absolute chirality.

In order to minimize the penalty function, a conjugate gradient minimization method can be used. Different penalty functions have been defined in different distance geometry approaches [25].

8.5 Scaling

At the very end, the obtained protein structure can be scaled so that it represents a globular protein. To this purpose, the expected radius of gyration of the structure is calculated. This expected radius can be larger or smaller than the radius of gyration calculated from the embedded coordinates. Therefore, a scaling factor equal to the ratio between expected and actual radius of gyration is computed. The embedded coordinates are then multiplied by this factor, because it makes any successive regularizations easier to perform.

9 Simulated Annealing

9.1 SA in Cartesian Space

As discussed in the previous section, the first method for distance geometry problems arising in the molecular context makes use of gradient conjugate minimizations of a given penalty function. We remark that such penalty functions do not consider many molecular forces that are instead used in molecular dynamics simulations. As a consequence, structures obtained by this method can produce correct overall folds, but they have poor local geometry. It was realized then that these structure were a very good input for restrained molecular dynamics simulation.

The first approach using restrained molecular dynamics simulation was employed to refine structures calculated from distance matrix distance geometry. The group of Clore and Gronenborn [5052] used a simulated annealing (SA) algorithm in order to find solutions for multiple variable systems. SA was derived from a metallurgic process where the system is heated at extremely high temperatures and let cooling down slowly. The simulation of this process could allow the atoms of a molecule to assume a low-energy configuration [35].

Standard molecular dynamics simulation force fields are built in order to reproduce the behavior of a molecular system in thermal equilibrium (constant temperatures). High-energy transitions such as cis/trans isomerization and steric hindrance cannot be surpassed using these force fields. For standard molecular dynamics simulation, the calculated structures do not change so much from their initial conformation, or they get stuck at a local minima. In order to partially solve the problem of sampling the conformational space given by the experimental restraints, a set of simplifications was proposed.

The first simplification consists in associating to every atom the same molecular weight (typically 100). This avoids high-frequency bond and angle vibrations, enabling a significant reduction in the number of thermalization steps. If the thermalization is too fast, with a reduced number of integration steps, then high-frequency vibrations, which affect mostly low atomic weight atoms such as hydrogens, can generate strong forces that could break covalent bonds. This simplification is especially important in the SA protocol, where the bath temperature increases up to 2,000 K and the thermalization is essential for the success of the process.

Another simplification is the turning off of attraction nonbonded interaction during the hot phase of SA. The Coulomb term is turned off and the van der Waals potential is replaced by the simplified term (REPEL) [51]:

$${F}_{\mathrm{REPEL}} = \left \{\begin{array}{l@{\quad }l} 0, \quad &\mathrm{if}\ \ r \geq s.{r}_{\min } \\ {k}_{\mathrm{rep}}({s}^{2}{r}_{\min }^{2} - {r}^{2}),\quad &\mathrm{if}\ \ r < s.{r}_{\min },\\ \quad \end{array} \right .$$

where the values of r min are the standard values for van der Waals radii (defined in the force fields) [6]. The scale factor s is set to 1. 0 in the hot phase and to 0. 825 in the cooling phases. In REPEL, only the repulsive term of the Lennard-Jones potential is maintained, reducing in this way the computational cost. This allows for surpassing high-energy barriers, which are due, in many situations, to the attractive forces imposed by Coulomb and VdW interaction, which aid the conformational space sampling.

Additionally, the force field is modified by increasing the penalty for bond and angle geometry violations. Finally, the distance restraint quadratic potential [Eq. (18.3)] is replaced by a simplified linear term, where the penalties increase linearly with the distance restraint violation. It was shown that this modification allows for correcting faster the geometry of the molecule.

During SA, the weight of force field parameters is adjusted to favor the conformational sampling. A typical sequence of events in an SA protocol is showed in Fig. 18.10, where the distance restraint potential (NOE) is weighted high during all phases, while the Coulomb term is turned off.

Fig. 18.10
figure 10

Illustration of a typical Cartesian space SA protocol used for protein structure calculation. Scheduled changes in the parameter values are plotted as a function of the time steps. The bath temperature is represented as the solid grey line, the dihedral angle potential as a dotted grey line, the distance restraint potentials as solid black lines, and the VdW potential by the dashed black lines [7, 8]

This new method was included in the structure calculation program XPLOR [8], where a hybrid approach to distance geometry was implemented: both target function minimization and simulated annealing in the structure calculation.

The starting structure is calculated using the distance matrix distance geometry algorithm [73]. Successively, target function minimization is performed and finally a series of cycles of simulated annealing calculation are executed. It is common to compute hundreds of structures. However, only the 20 lower-energy structures are selected to represent the protein.

9.2 SA in Torsion Angle Space

Molecular dynamics simulations (as well as SA) in the Cartesian space uses Newton mechanics at discrete time steps in order to describe the protein motion [Eqs. (18.1) and (18.2)]. Newton equations deduce the motion equations of a system from the knowledge of all external forces acting on it.

Another way to approach the molecular mechanics is by solving Lagrange equations. Lagrange mechanics uses scalar equations, which avoid the need to describe all the external forces that act on the system in a vectorial formalism. The Lagrangian function is defined by the difference among kinetic and potential energy:

$$L = T - V,$$

where T is the kinetic energy and V is the potential energy of the system. The motion equations are obtained from the Lagrangian function by the following differential equation:

$$\frac{\mathrm{d}} {\mathrm{d}t}\left ( \frac{\partial L} {\partial \dot{{q}}_{i}}\right ) - \frac{\partial L} {\partial {q}_{i}} = 0,$$
(18.6)

where the q i ’s represent the coordinates of the system and \(\dot{{q}}_{i}\) is the time derivative of the system coordinates (velocity). Note that this equation is not vectorial.

In order to illustrate the Lagrangian mechanics, we consider a simple system consisting of a linear spring-mass system on a frictionless table. The Lagrangian function becomes

$$L = T - V = \frac{1} {2}m\dot{{x}}^{2} -\frac{1} {2}k{x}^{2},$$

where m is the mass, x is the linear coordinate, and k is the spring constant. The conservative system (18.6) becomes

$$\frac{\mathrm{d}} {\mathrm{d}t}(m\dot{x}) + kx = 0\quad \Rightarrow \quad m\ddot{x} + kx = 0.$$

Note that the differentiation led to the equation of motion of the system in the same form as for the Newtonian formalism of classical mechanics, but without the need of figuring out all external vectorial forces on the system.

The same can be done for simulating the motions of a protein. The great advantage is that we can compute positions and the movements (acceleration) of the atoms by simplifying the coordinate system. The variables are only the torsion angles of a protein. The degree of freedom is decreased about tenfold, because the geometrical parameters, such as bond lengths, bond angles, and improper dihedrals (chirality and planarity), are fixed to their optimal values during the simulation.

As discussed before, what makes the search for conformational space by methods such as SA difficult is the rough energy landscape for a protein. There are many local minima to be avoided by computational methods. The strategies to reach the global minimum and avoid kinetic traps demand high computational time and special algorithms.

In Cartesian SA, much of the computational time is focused on calculations of geometrical parameters that almost do not change. The deviations from optimal geometry of bond lengths, bond angles, chirality, and planarity are small because they are parameterized to be as small as possible. In torsion angle dynamics, instead, these are fixed and so is the number of local minima. This is the main reason why torsion angle dynamics increase the efficiency of the search for conformational space imposed by the NMR experimental restraints.

The force field which is used in Cartesian dynamics considers strong potentials in order to keep the covalent structures. In torsion angle dynamics, the parameters are much simplified. One important aspect is that the time step for numerical integration in Cartesian dynamics must be very small ( < 5 fs), and there is therefore the risk of breaking some covalent structures because of bond and angle with high-frequency vibrations. In torsion angle dynamics, time steps can be three times longer because the covalent structures are fixed and such vibrations are inexistent.

In the implementation of torsion angle dynamics, the protein is described as a tree of rigid bodies connected by single bonds. The only degrees of freedom are rotations around the single bonds. The tree structure starts with a base, typically at the N-terminus and ends with the “leaves” that are the end of the side chains and the C-terminus. The rigid bodies are labeled from 0 to n. The base is number 0 and each torsion angle is represented as θ k , where k ≥ 1. The conformation of the molecule can be uniquely specified by its torsion angle \(\theta= ({\theta }_{1},{\theta }_{2},\ldots ,{\theta }_{n})\).

The potential energy is defined as

$$V = \left \{\begin{array}{l@{\quad }l} 0 \quad &\mathrm{if\ distances\ and\ angles\ are\ within\ the\ bounds\ and\ atoms\ are} \\ \quad &\mathrm{not\ overlapped} \\ {V }_{\mathrm{target}}\quad &\mathrm{otherwise},\\ \quad \end{array} \right .$$

where V target is the target function that is dependent on the upper and lower bounds for the distance and on the angular restraints. ω0 is a weighting factor. Note that V > 0 if the experimental bound are not satisfied or atoms are overlapped. Motion occurs when V > 0. The kinetic energy and the inertia tensor are calculated recursively at each time step of numerical integration. For details on the algorithms, see [22, 24].

The Lagrange equation takes the form

$$\frac{\mathrm{d}} {\mathrm{d}t}\left ( \frac{\partial L} {\partial \dot{{\theta }}_{i}}\right ) - \frac{\partial L} {\partial {\theta }_{i}} = 0.$$

The differentiation leads to equation of motions that takes the form:

$$M(\theta )\ddot{\theta } + C(\theta ,\dot{\theta }) = 0,$$
(18.7)

where M(θ) is the mass matrix and \(C(\theta ,\dot{\theta })\) is a constant n-dimensional vector. Note that Eq. (18.7) was obtained by using a similar mathematical procedure presented in the simple system of linear spring-mass system in a frictionless table [Eq. (18.6)]. For a detailed description, see [22].

Torsion angle space SA is efficient for searching conformational space because it smoothes the protein energy landscape, avoiding local minima. It also enables the hot phase of SA at very high temperature, such as 50,000 K. However, we have to mention that it is a statistical method and there is no mathematical proof that the global minimum could be actually found.

The introduction of the torsion angle space SA solved the problem of searching the conformational space given by NMR experimental restraints. It is the most frequently used method, and it is implemented in all programs developed for structural determinations, such as XPLOR-NIH, CNS, and CYANA. The algorithm is very efficient and enables the calculation of a protein structure in minutes.

10 Future Demands for Protein Structure Determination

This present chapter showed the evolution of computational methods for protein structural determination using NMR experimental data. It is clear that structural determination by NMR does not rely on direct spatial data but on a set of small-range experimental distance and angle restraints that, combined with some structural geometrical information on proteins, can be exploited for producing structural models. Over the years the NMR structures determined by the methods discussed in this chapter have been accepted by the scientific community as realistic and useful for studying biochemical mechanistic problems.

The torsion angle space SA protocols can be very efficient for searching the conformational space under the constraints given by NMR experiments. All semiautomated methods for structural determination, such as ARIA [62] and UNIO [17, 19, 28, 72], make use of torsion angle space SA. It is also implemented in the software tools for structure determination by NMR, such as XPLOR-NIH [64], CNS [7, 8], and CYANA [23, 24, 28, 47].

Although distance geometry combined with simulated annealing (DGSA) is not the most usual method, it offers many advantages: (1) distance geometry is not a statistical method and can offer mathematical proof that the global minimum has been achieved; (2) DGSA is as fast and efficient in the search of conformational space as it is torsion angle space simulated annealing; (3) since DGSA relies on a geometrical method for the search of conformational space, it can be used for large proteins and complexes. Statistical methods, on the other hand, can become inefficient when the size of the protein is large.

The increase in protein size also imposes a more restricted number of restraints. NMR spectroscopy can nowadays generate structural information for large proteins and protein complexes. However, such a structural information is sparse and new methods for structural calculation with sparse date are becoming increasingly important.

Standley [69] proposed in 1999 a branch-and-bound algorithm for protein refinements with sparse data. They used distance geometry methods to minimize an error function which is based on the experimental restraints, as well as a residue-based protein folding potential. This algorithm is able to identify more compact structures. The protein folding term is based on the idea of using long-range potentials so that the dependence of long-range distance restraints is reduced.

Dong and Wu [16] in 2003 introduced a geometrical method for solving NMR structure with sparse data. In general, NMR spectroscopy generates experimental data that are not complete. They have used geometrical information, in a similar way as bound smoothing and metrization uses triangle and tetrangle inequalities, to build up the “missing” information. The algorithm calculates the coordinates of a given atom on the basis of the coordinates of the previously computed atoms and of the distances between the current and the previous atoms. Some assumptions need to be satisfied in order to use this algorithm. Davis et al. [14] proposed an improved algorithm, called revised updated geometric build-up algorithm (RUGB), to build up missing information.

Liberti et al. [41] proposed the use of a discrete search occurring in continuous space for solving protein structure. The main idea is to use distance information between atoms that are contiguous (sequential) in order to discretize the search space (which has the structure of a tree), and to employ a branch-and-prune algorithm for solving the discretized problem. In the branch-and-prune, new candidate atomic positions are generated at each iteration (branching), and their feasibility is verified immediately so that branches of the tree which do not contain solutions can be removed (pruning). The branch-and-prune can work with both exact or interval data [39] and also in the hypothesis in which only distances between hydrogen atoms are available [53].

In conclusion, structural determination using NMR experimental data needs the use of efficient computational methods. The continuous development of NMR and of computational methods can improve the quality, efficiency, and limits for structural determination by NMR.