Introduction

The structure determination of membrane proteins, amyloidal fibrils, and large complexes represents one of the biggest challenges in the area of structural biology. NMR spectroscopy is an important tool for such investigations, since it can provide both structural as well as dynamic information. The basis for any detailed investigation of proteins by NMR is the resonance assignment, which can take up considerable time and effort and is often hindered by experimental issues, especially in larger proteins, protein complexes, membrane proteins, or amyloid fibrils. Larger molecules have longer rotational correlation times and consequently shorter transverse relaxation times T 2, leading to line broadening in the NMR spectrum. Especially in 3D and 4D spectra even the natural linewidth given by T 2 relaxation can usually not be reached because the limited acquisition time demands truncation of the free induction decays in the indirect dimensions. Hence there is a convolution of the signal with a function of broadness inversely proportional to the maximum evolution time, which in general results in broader lines than relaxation (Ernst et al. 1987; Szántay 2007). This results in signal overlap in the spectrum. With membrane proteins the situation is further complicated because the protein has to be surrounded by amphipathic molecules that increase the effective molecular weight, resulting in broader line widths. Combined with the relatively narrow chemical shift dispersion found in α-helices, this increases overlap and leads to assignment ambiguity. Using 3D and 4D spectra can in principle reduce overlap, but, as mentioned above, at least part of the advantage is lost by truncation effects and lower sensitivity. Higher static magnetic fields B 0 can improve the resolution and signal-to-noise ratio but may not be readily available. Another way to reduce the overlap is to sparsely label the sample (Goto and Kay 2000; Kainosho and Güntert 2009; Lian and Middleton 2001), for example by using labeled amino acids (Higman et al. 2009; Kainosho et al. 2006), segmental labeling (Busche et al. 2009; Yamazaki et al. 1998), transmembrane segment enhanced labeling (Reckel et al. 2008), or employing pair-labeling strategies (Hefke et al. 2011).

The extent of overlap is a function of the line width, the number of peaks, and their dispersion in the spectrum. The simplest model for estimating peak overlap assumes that peaks are uniformly distributed in the spectrum (Mumenthaler et al. 1997). To get a first estimate of how overlap becomes more problematic in larger proteins with more shifts and broader resonance lines, we consider N peaks, distributed randomly within a region of size Γ in a n-dimensional spectrum. Each peak is assumed to occupy a “peak region” of size γ, given by the data points of the peak that are significantly above the noise level. A peak is classified as overlapped if the center of at least one other peak falls within its peak region. The expected number of peaks that are not overlapped with other peaks can be approximated by (Kainosho et al. 2006)

$$ \tilde{N} = N\left( {1 - \gamma /\Upgamma } \right)^{N - 1} \approx Ne^{ - N\gamma /\Upgamma } . $$
(1)

The number \( \tilde{N} \) of non-overlapped peaks decreases exponentially with the size of the protein and the size of the peak region. The quantity \( N\gamma /\Upgamma \) in the exponent of Eq. 1 is the fraction of the entire spectral space that would be occupied by peaks in the absence of any overlap. If the n-dimensional peak regions γ and the spectral region Γ are the product of corresponding one-dimensional regions γ 1 and Γ1, one obtains

$$ \tilde{N} \approx Ne^{{ - N(\gamma_{1} /\Upgamma_{1} )^{n} }} . $$
(2)

With increasing number n of dimensions the space in which the peaks are distributed increases exponentially and overlap is significantly decreased. However, this simplistic model underestimates the amount of overlap that occurs in a real spectrum because it makes the unrealistic assumption that peaks are distributed uniformly and independently in the spectrum. The overlap probability expressed by Eqs. 1 and 2 is therefore overly optimistic. In reality peak positions are determined by the underlying chemical shifts, which in turn are dependent on the type of amino acid they originate from and other factors such as the secondary structure. For instance, α-helical membrane proteins exhibit narrower chemical shift distributions than β-sheet proteins (Oxenoid et al. 2004).

In this article we present a new method to estimate the expected spectral overlap that assigns to peaks in a spectrum a probability of being overlapped by using atom-specific chemical shift distributions from the Biological Magnetic Resonance Data Bank (BMRB) (Ulrich et al. 2008).

Materials and methods

The new method for estimating peak overlap has been implemented in the CYANA software package (Güntert 2009; Güntert et al. 1997). An overview of the algorithm is given in Fig. 1.

Fig. 1
figure 1

Flowchart of the overlap prediction procedure

Chemical shift database

To account for the different chemical shift distributions of individual atoms, shifts are not treated as uniformly distributed over the entire NMR spectrum. Instead, the chemical shift of atom k is assumed to be distributed normally with mean ω k and standard deviation σ k :

$$ \mu_{N} \left( {x; \omega_{k} ,\sigma_{k} } \right) = \frac{1}{{\sqrt {2\pi \sigma_{k} } }}e^{{ - \frac{1}{2}\left( {\frac{{x - \omega_{k} }}{{\sigma_{k} }}} \right)^{2} }} . $$
(3)

The mean value ω k and standard deviation σ k are obtained from the shift statistics of the BMRB database that stores for every atom the mean position, standard deviation, and number of occurrences in all protein data sets in the database. The distributions μ N of Eq. 3 are only reliable if they are based on a sufficient number of chemical shift values. By default, at least 100 measured values were required. Two separate normal distributions are used for the oxidized and reduced forms of cysteine, which the user distinguishes in the protein amino acid sequence by using the residue codes CYSS and CYS for oxidized and reduced cysteine, respectively. Other cases, such as cis/trans proline, can be handled similarly. The statistics can also be obtained from other sources than the BMRB. For instance, if shifts exist for homologous or otherwise similar proteins, the database can be tailored to a certain class of proteins. Any given set of chemical shift lists and sequences for proteins can be readily processed into a CYANA library with corresponding chemical shift statistics. For the calculations of this paper, we used the general chemical shift statistics from the BMRB database.

Expected peaks

The algorithm estimates overlap probabilities using lists of peaks that are expected based on experiment type-specific magnetization transfer pathways and the covalent structure of the protein (Bartels et al. 1997; Schmidt and Güntert 2012; Schmucki et al. 2009). The magnetization transfer pathways for a spectrum are given as connectivity patterns stored in the CYANA library. For instance, the HNCA spectrum can be described by the magnetization transfer pathways for its intra- and interresidual peaks:

$$ \begin{gathered} {\text{SPECTRUM}}\;\,{\text{HNCA}}\quad {\text{HN}}\;\,{\text{N}}\;\,{\text{C}} \hfill \\ 0.98\quad {\text{HN}}:{\text{H}}\_{\text{AMI}}\quad {\text{N}}:{\text{N}}\_{\text{AMI}}\quad {\text{C}}:{\text{C}}\_{\text{ALI}}\quad {\text{C}}\_{\text{BYL}} \hfill \\ 0.80\quad {\text{HN}}:{\text{H}}\_{\text{AMI}}\quad {\text{N}}:{\text{N}}\_{\text{AMI}}\quad {\text{C}}\_{\text{BYL}}\quad {\text{C}}:{\text{C}}\_{\text{ALI}}\quad {\text{N}}\_{\text{AMI}} \hfill \\ \end{gathered} $$

The first line gives the spectrum name and the atom labels that will be used to identify the respective columns in the peaks lists. The number of atom labels defines the dimensionality n of the spectrum. Each of the following lines specifies a (formal) magnetization transfer pathway, characterized by the probability of the resulting expected peak (not used by the present algorithm) followed by a series of atom types (H_AMI, amide hydrogen; N_AMI, amide nitrogen, C_ALI, aliphatic carbon, C_BYL, carbonyl carbon, etc., as used in the CYANA residue library) that define a molecular pattern of atoms linked by direct covalent bonds. In each pathway the n atoms whose shifts will determine the position of the resulting peak are identified by their corresponding atom labels, followed by ‘:’. Note that in the case of the HNCA spectrum, the pathways include a “detour” through the carbonyl carbon (C_BYL) to exclude peaks originating from Hε–Nε–Cδ in Arg and Hζ–Nζ–Cε in Lys. Through-space type experiments are approximated by the subset of short-range peaks using an extended set of magnetization pathways, which is accurate enough for the present purpose. The magnetization pathway library can be adapted and extended easily.

The peak lists generated by expected peak prediction are “perfect” and contain in general more peaks than can be identified in a real spectrum. Expected peaks are generated only for atoms with, by default, 100 shift values in the BMRB database. Groups of atoms with degenerate chemical shifts, e.g. methyl groups, are represented by a single shift value.

Definition of overlap between two peaks

For the purpose of overlap prediction a peak is considered overlapped if it cannot be resolved from other peaks in n-dimensional space. The most straightforward implementation of this criterion would classify a peak as overlapped if the center of at least one other peak falls within its peak region, and to define the peak region by hard cutoffs for each spectral dimension. However, in order to simplify the theory, we define the ability to distinguish peaks by a Gaussian function of the peak position difference rather than by a fixed “hard” cutoff for this difference because this allows the derivation of analytic expressions for the overlap probabilities (see below). Since the distance between two peaks is not the only factor that decides whether they can be distinguished or not (others including the relative peak intensities, local noise level, peak shape, etc.), the “soft” approach is equally sensible as a hard cutoff, and very similar results are expected for both approaches (see “Results and discussion” below). The probability that two peaks cannot be distinguished from each other in one dimension of a spectrum is defined to be

$$ p\left( {\Updelta x} \right) = e^{{ - \frac{1}{2}\left( {\frac{\Updelta x}{\delta }} \right)^{2} }} , $$
(4)

where Δx is the difference of the peak positions, and the overlap tolerance δ is a parameter that can be set by the user according to the expected resolution of the spectrum. Equation 4 expresses in a “soft” way the idea that two peaks cannot be distinguished if they are exactly overlapped (p(0) = 1), that distinction is difficult for Δx < δ, and clear for Δx ≫ δ.

In principle, the overlap tolerance is related to both the collected and processed digital resolution of the spectra and the relaxation times of the involved nuclei. For convenience, because peak positions and chemical shift values are given in ppm, the overlap tolerance δ is specified in ppm, even though, strictly speaking, it should be expressed in Hz, which is the proper unit for both relaxation and signal truncation linewidth. The default values of the overlap tolerance are δ H = 0.03 ppm for 1H dimensions and δ N = δ C = 0.3 ppm for 15N and 13C dimensions. The overlap tolerance could be set according to the chemical shift error values in the chemical shift data files of the BMRB database (Ulrich et al. 2008). However, since different ways of setting of the chemical shift error values in the BMRB appear to be in use for different proteins, and because we did not have access to the original spectra, we chose to use uniform values of δ H = 0.03 ppm for 1H dimensions and δ N = δ C = 0.3 ppm for 15N and 13C dimensions for all calculations in this paper. In practice, the overlap tolerances should be set by visually inspecting the spectra and choosing δ based on the smallest separation between neighboring, distinguishable peaks. In addition, the choice of δ may depend on how the spectra will be used: If it is sufficient to detect the presence of a peak, e.g. for resonance assignment, a smaller overlap tolerance is acceptable than for applications that require accurate peak intensities, e.g. NOESY spectra for the collection of conformational restraints, and even larger overlap tolerances are advisable if the peak shape or peak fine structure are to be analyzed, e.g. for determining scalar coupling constants (Szyperski et al. 1992).

The overlap definition of Eq. 4 can be related to a more traditional overlap definition using a hard cutoff δ′ to define overlap when |Δx| < δ′. The corresponding probability is \( p^{\prime } \left( x \right) = \theta (\delta^{\prime } - \left| {\Updelta x} \right|), \) where θ is the Heaviside step function that equals one for positive arguments and zero otherwise. Equating the integrals over the two probabilities, \( \int_{ - \infty }^{\infty } {p\left( {\text{x}} \right)dx} = \int_{ - \infty }^{\infty } {p^{\prime } \left( {\text{x}} \right)dx} , \) yields the relationship \( \delta^{\prime } = \sqrt {{{\uppi}}/2} \delta \approx 1.25 \delta \) between the two overlap tolerance parameters. This means that the expected overlap computed on the basis of Eq. 4 will be approximately equivalent to the expected overlap computed with a 25 % larger hard cutoff.

Overlap probability for two peaks in one dimension

The overlap definition of Eq. 4 suffices to calculate the number of overlapped peaks in a peak list in which all peak positions are fixed. However, given only the sequence of the protein, estimating the overlap for a list of expected peaks whose position is not yet known requires integration over the chemical shift distributions that describe the expected peak positions, as will be described in the following.

We consider two atoms with chemical shifts that are not known precisely. We assume that they follow normal distributions μ N according to Eq. 3 with mean values ω 1 and ω 2, and standard deviations σ 1 and σ 2, respectively. The probability P deg that two corresponding peaks in a one-dimensional spectrum, or in one dimension of a multidimensional spectrum, cannot be distinguished is

$$ \begin{aligned} P_{\rm deg } \left( {\omega_{1} ,\sigma_{1} ,\omega_{2} ,\sigma_{2} ,\delta } \right) & = \mathop \int \limits_{ - \infty }^{\infty } dx_{1} \mu_{N} (x_{1} ;\omega_{1} ,\sigma_{1} ) \mathop \int \limits_{ - \infty }^{\infty } dx_{2} \mu_{N} \left( {x_{2} ;\omega_{2} ,\sigma_{2} } \right)p\left( {x_{1} - x_{2} } \right) \\ & = \frac{\delta }{{\sqrt {\sigma_{1}^{2} + \sigma_{2}^{2} + \delta^{2} } }}\exp \left( { - \frac{1}{2}\left( {\frac{{\omega_{1} - \omega_{2} }}{{\sqrt {\sigma_{1}^{2} + \sigma_{2}^{2} + \delta^{2} } }}} \right)^{2} } \right) \\ & = \sqrt {2\pi } \delta \mu_{N} \left( {\omega_{1} - \omega_{2} ;0, \sqrt {\sigma_{1}^{2} + \sigma_{2}^{2} + \delta^{2} } } \right), \\ \end{aligned} $$
(5)

where \( p\left( {x_{1} - x_{2} } \right) \) is the probability of Eq. 4 that two signals with chemical shift difference \( x_{1} - x_{2} \) cannot be distinguished, and it is assumed that the distributions of the two chemical shifts are independent. Substituting \( \Updelta \omega = \omega_{1} - \omega_{2} \) for the difference between the mean values of the two atom chemical shifts and \( \sigma = \sqrt {\sigma_{1}^{2} + \sigma_{2}^{2} } \) for the geometric mean of their standard deviations, Eq. 5 becomes

$$ P_{\rm deg } \left( {\Updelta \omega , \sigma ,\delta } \right) = \frac{\delta }{{\sqrt {\sigma^{2} + \delta^{2} } }}\exp \left( { - \frac{1}{2}\left( {\frac{\Updelta \omega }{{\sqrt {\sigma^{2} + \delta^{2} } }}} \right)^{2} } \right) = \sqrt {2\pi } \delta \mu_{N} (\Updelta \omega ;0, \sqrt {\sigma^{2} + \delta^{2} } ), $$
(6)

For \( \sigma \gg \delta , \) as is usually the case, this further simplifies to

$$ P_{\rm deg } \left( {\Updelta \omega , \sigma ,\delta } \right) \approx \frac{\delta }{\sigma }\exp \left( { - \frac{1}{2}\left( {\frac{\Updelta \omega }{\sigma }} \right)^{2} } \right) = \sqrt {2\pi } \delta \,\mu_{N} (\Updelta \omega ;0, \sigma ). $$
(7)

The overlap probability P deg of Eq. 6 can be expressed as a function of only two variables, the dimensionless quantities Δω/δ and σ/δ,

$$ P_{deg} \left( {\Updelta \omega /\delta , \sigma /\delta } \right) = \frac{1}{{\sqrt {1 + \left( {\sigma /\delta } \right)^{2} } }}\exp \left( { - \frac{1}{2}\left( {\frac{\Updelta \omega /\delta }{{\sqrt {1 + \left( {\sigma /\delta } \right)^{2} } }}} \right)^{2} } \right) = \sqrt {2\pi }\,\mu_{N} (\Updelta \omega /\delta ;0,\sqrt {1 + \left( {\sigma /\delta } \right)^{2} } ). $$
(8)

P deg is an exponentially decaying function of Δω, but decreases only slowly with increasing σ (Fig. 2).

Fig. 2
figure 2

Overlap probability \( P_{\rm deg } \left( {\Updelta \omega /\delta , \sigma /\delta } \right) \) of Eq. 8 for two peaks in one dimension. The chemical shifts of the two atoms are assumed to be normally distributed with mean values ω 1 and ω 2, and standard deviations σ 1 and σ 2, respectively; \( \Updelta \omega = \omega_{1} - \omega_{2} \) is the difference between the mean values of the two atom chemical shifts, \( \sigma = \sqrt {\sigma_{1}^{2} + \sigma_{2}^{2} } \) is the geometric mean of their standard deviations, and δ is the overlap tolerance parameter introduced in Eq. 4

The approach can also be used if the position of one of the two peaks is already known by setting the corresponding standard deviation to zero, e.g. σ 1 = 0. Equation 8 remains valid with σ = σ 2. If the positions of both peaks are fixed, \( \sigma = \sigma_{1} = \sigma_{2} = 0, \) and Eq. 8 reduces to Eq. 4.

Overlap probability for two peaks in n dimensions

Two peaks cannot be distinguished in an n-dimensional spectrum if they overlap in each of the n dimensions. Their overlap probability therefore becomes:

$$ P_{\rm deg }^{\left( n \right)} \left( {\frac{{\Updelta \omega^{\left( 1 \right)} }}{{\delta^{\left( 1 \right)} }}, \ldots ,\frac{{\Updelta \omega^{\left( n \right)} }}{{\delta^{\left( n \right)} }};\frac{{\sigma^{\left( 1 \right)} }}{{\delta^{\left( 1 \right)} }}, \ldots ,\frac{{\sigma^{\left( n \right)} }}{{\delta^{\left( n \right)} }}} \right) = \mathop \prod \limits_{i = 1}^{\prime n} P_{\rm deg } \left( {\frac{{\Updelta \omega^{\left( i \right)} }}{{\delta^{\left( i \right)} }},\frac{{\sigma^{\left( i \right)} }}{{\delta^{\left( i \right)} }}} \right) $$

To account for the fact that peaks assigned to the same atom must be aligned in the corresponding dimension, these peaks are considered as fully overlapped in the respective dimension and the corresponding P deg term is omitted from the product.

Overlap probability for N peaks in n dimensions

In a data set of N peaks in an n-dimensional spectrum, the probability \( P_{\rm deg }^{{({\text{tot}})}} \) that a given peak j overlaps with one or more other peaks is the complement of the probability that it does not overlap with any other peak:

$$ P_{\rm deg }^{{\left( {\text{tot}} \right)}} (j) = 1 - \mathop \prod \limits_{k = 1}^{\prime N} \left[ {1 - P_{\rm deg }^{\left( n \right)} \left( {\frac{{\Updelta \omega_{jk}^{(1)} }}{{\delta^{\left( 1 \right)} }}, \ldots ,\frac{{\Updelta \omega_{jk}^{(n)} }}{{\delta^{\left( n \right)} }};\frac{{\sigma_{jk}^{(1)} }}{{\delta^{\left( 1 \right)} }}, \ldots , \frac{{\sigma_{jk}^{(n)} }}{{\delta^{\left( n \right)} }}} \right)} \right]. $$
(9)

The product in Eq.  9 runs over all peaks other than peak j; \( \Updelta \omega_{jk}^{(i)} = \omega_{j}^{(i)} - \omega_{k}^{(i)} \) and \( \sigma_{jk}^{(i)} = \sqrt {\sigma_{j}^{(i)2} + \sigma_{k}^{(i)2} } . \) The expected total number of overlapping peaks thus becomes

$$ N_{\rm deg } = \mathop \sum \limits_{j = 1}^{N} P_{\rm deg }^{{\left( {\text{tot}} \right)}} (j). $$
(10)

In the special case of equal overlap probabilities \( P_{\rm deg }^{\left( n \right)} \) for all peak pairs, Eqs. 9 and 10 reduce to Eq. 1 with \( P_{\rm deg }^{\left( n \right)} = \gamma /\Upgamma . \) Thus, the present theory is a generalization of the simple earlier approaches (Kainosho et al. 2006; Mumenthaler et al. 1997).

Verification by Monte-Carlo simulation

Monte-Carlo simulation was used to verify the correctness of the theory of Eqs. 310 by simulating for a given sequence a large number of peak lists according to the model of Eq. 3. Shift positions of atoms were sampled from normal distributions with mean values and standard deviations corresponding to the shift statistics, and peak lists were generated according to the magnetization transfer pathways of the NMR experiment. A peak pair was considered to be overlapped with the probability of Eq. 4. The procedure was repeated 50,000 times and the average overlap probability was compared with the analytical result of Eq. 10.

Normally distributed random numbers were generated by the transformation method (Press et al. 1986): Two random numbers x 1, x 2, distributed uniformly in the interval [−1, 1], are generated. If \( r^{2} = x_{1}^{2} + x_{2}^{2} > 1, \) they are rejected, and a new pair of random numbers is generated. Otherwise, two normally distributed random numbers u 1, u 2 are obtained as \( u_{1,2} = x_{1,2} \sqrt { - 2\log (r^{2} )/r^{2} } . \)

Test data sets

The algorithm was evaluated for eight different proteins to which we refer in this paper by four-letter acronyms (Table 1): CPRP, the chicken prion protein fragment 128–242 (Calzolai et al. 2005); ENTH, the ENTH-VHS domain At3g16270 from Arabidopsis thaliana (López-Méndez and Güntert 2006; López-Méndez et al. 2004); FSH2, the Src homology 2 domain from the human feline sarcoma oncogene Fes (Scott et al. 2004, 2005); FSPO, the F-spondin TSR domain 4 (Pääkkönen et al. 2006); PBPA, the Bombyx mori pheromone binding protein (Horst et al. 2001); RHOD, the rhodanese homology domain At4g01050 from Arabidopsis thaliana (Pantoja-Uceda et al. 2004, 2005); SCAM, stereo-array isotope labeled (SAIL) calmodulin (Kainosho et al. 2006); DSRP, the delta subunit of RNA polymerase from Bacillus subtilis (Motáčková et al. 2010). The proteins CPRP, ENTH, PBPA, and SCAM are predominantly α-helical; FSH2, and RHOD have mixed α/β secondary structure. The protein FSPO has an unusual fold with little regular secondary structure (Pääkkönen et al. 2006). The protein SCAM has two domains connected by a flexible linker; DSRP is an intrinsically disordered protein that contains a disordered C-terminal region of 81 amino acids with a highly repetitive sequence; all others have a well-defined single-domain structure.

Table 1 Overview of protein data sets used for overlap prediction

In addition, overlap prediction was also carried out for the [1H,15N]-HSQC spectra of 2,174 proteins for which chemical shift assignments are available from the BMRB that are sufficiently complete to assign more than 70 % of the expected peaks.

Results and discussion

Our goal was to provide a flexible and user-friendly algorithm that is capable of predicting spectral overlap in NMR spectra and that can estimate the usefulness of labeling schemes, given a specific sequence, prior to producing samples and measuring NMR spectra. Overlap prediction for a spectrum with several hundred peaks takes about 2 s on a standard desktop computer with 2.4 GHz Intel processor. The maximal runtime of 28 s was measured for a TOCSY spectrum with several thousand peaks for the largest protein in the BMRB.

Measured and predicted overlap in a [1H,15N]-HSQC spectrum

As a first test application the predicted overlap was compared to the overlap observed in the experimental [1H,15N]-HSQC of the protein RHOD, for which the chemical shift assignments and the experimental peak list are available (Fig. 3a). Expected peaks were generated using the magnetization transfer rules in the CYANA library (Schmidt and Güntert 2012; Schmucki et al. 2009), and the overlap probability was calculated for the peaks at the positions given by the experimental chemical shift (“measured overlap”, Fig. 3b) and by Eq. 9 without knowledge of the peak positions (“predicted overlap”, Fig. 3c). In both cases most overlap occurs in the same regions of the spectrum. As expected, the predicted overlap is distributed over many peaks in the crowded regions, whereas the measured overlap affects specific peaks. In principle, the experimental spectrum is an instance taken from the general distribution over which overlap prediction by Eq. 9 is averaging. Overlap prediction is able to distinguish peaks in crowded regions from those in better resolved regions and could thus be used to optimize a labeling pattern that reduces the peak overlap without undue loss of signals.

Fig. 3
figure 3

Overlap in the [1H,15N]-HSQC spectrum of RHOD. a Experimental spectrum (Pantoja-Uceda et al. 2004). b Spectrum simulated using the experimental chemical shifts. Signals are colored from white to black with increasing overlap calculated for the fixed peak positions using Eq. 4. c Same spectrum as in b, colored according to the overlap probability predicted by Eq. 9 using only the sequence and spectrum type information

The effect of additional dimensions on the overlap

Higher-dimensional NMR spectroscopy reduces overlap significantly. To show that the algorithm correctly predicts this behavior we compared overlap predictions for the protein RHOD using two pairs of corresponding two- and three-dimensional spectra, i.e. 2D NOESY versus 3D 13C-resolved NOESY, and [1H,15N]-HSQC versus HNCA (Fig. 4). The overlap predicted using Eq. 9 is strongly reduced by the presence of the extra dimension, especially in case of 2D NOESY (Fig. 4a) versus 3D NOESY (Fig. 4b). [1H,15N]-HSQC (Fig. 4c) and HNCA (Fig. 4d) show less overlap overall, but again the introduction of the third dimension in the HNCA removes most of the signal overlap present in the [1H,15N]-HSQC spectrum.

Fig. 4
figure 4

Overlap comparison for spectra with different numbers of dimensions for the protein RHOD. a 2D homonuclear NOESY spectrum. b 3D 13C-resolved NOESY spectrum. c 2D [1H,15N]-HSQC spectrum. d 3D HNCA spectrum, projected onto the [1H,15N]-plane. Signals are colored in red from white to black with increasing overlap predicted by Eq. 9 using only the sequence and general chemical shift statistics. The peak positions correspond to the known chemical shift assignments for RHOD

Overlap prediction for spectra of a test set of eight proteins

To show the overlap prediction with a variety of different types of spectra, eight proteins were analyzed for which chemical shift assignments are available (Table 1). The amount of overlap was predicted by Eq. 9 based only on the sequence and the general chemical shift statistics of the BMRB (blue bars in Fig. 5) and compared to the overlap measured on the basis of the known chemical shift assignments using the chemical shift list of the given protein from the BMRB (green crosses in Fig. 5). For comparison, the percentage of overlap and its standard deviation were also predicted using the Monte Carlo method (blue dots and error bars in Fig. 5).

Fig. 5
figure 5

Overlap prediction and measurement for eight proteins, one of which (DSRP) is intrinsically unstructured. The percentage of overlapped peaks predicted from the sequence alone using Eqs. 910 is shown as blue bars. The average value and the standard deviation of the predicted overlap obtained by Monte Carlo simulation are shown in blue. The measured overlap percentage obtained by applying Eq. 4 to the expected peaks at the positions given by the known experimental chemical shift assignments is indicated by green crosses. Where experimental assignments for a certain class of nuclei, e.g. carbonyls, were not available, only the predicted overlap for all atoms is reported

The overlap measured on the basis of the known chemical shift assignments (green crosses in Fig. 5) and the overlap predicted from the sequence (red dots in Fig. 5) are highly correlated for the spectra of a given protein with Pearson correlation coefficients of 0.84–0.98 (significance <0.00011 in all cases). As expected, overlap increases with protein size. Among the spectra analyzed for any given protein, the overlap is in general largest for the homonuclear 2D spectra, and smallest for triple resonance backbone assignment spectra. Generally, the overlap probability increases with the number of peaks in a spectrum, although this is not universal. The overlap prediction depicts faithfully differences in the measured overlap between different spectra. For longer proteins the predictions appear to become more accurate, which may be an effect of the law of large numbers and the central limit theorem from which it follows that the more peaks are analyzed, the better the assumptions of the theory are fulfilled. As expected, the intrinsically disordered protein DSRP is an exception in that the measured overlap exceeds significantly the one predicted on the basis of the general chemical shift statistics, which is derived almost exclusively from folded globular proteins. It will thus be necessary to derive separate chemical shift statistics for intrinsically disordered proteins in order to obtain more realistic results for this class of proteins. At present, the scarcity of chemical shift assignments available for intrinsically disordered proteins does not yet provide reliable statistics.

The correctness of the overlap prediction by the analytic formulas of Eqs. 810 (red in Fig. 5) was verified by Monte Carlo simulation (blue in Fig. 5). The average overlap values obtained by Monte Carlo simulation are always in close agreement with the analytical result. The standard deviation is often considerable, indicating that the amount of overlap observed for a single given protein can deviate significantly from the analytical average result even if the chemical shift values of the atoms follow the assumed normal distributions.

Analyzing proteins in the BMRB

As a large-scale application, we calculated for all 2,174 proteins with sufficiently complete chemical shift entries in the BMRB the extent of overlap in [1H,15N]-HSQC spectra by prediction based on the sequence alone and, for comparison, by measurement based on the chemical shift assignments from the BMRB (Fig. 6). This provided a means to investigate the prediction power of our method for a large variety of proteins and to rationalize the use of the soft overlap criterion of Eq. 4. Figure 6a shows that for all proteins the use of a hard cutoff or the “soft” criterion of Eq. 4 yielded very similar results. Figure 6b shows a comparison of the predicted and measured numbers of overlapped peaks. Overall, they are correlated with a correlation coefficient of 0.79 (significance <10−10). The spread between for individual proteins is comparable to the standard deviation seen in the Monte Carlo simulation results depicted in Fig. 5. In addition, there are some proteins for which the measured amount of overlap exceeds the predicted overlap considerably. The manual inspection of individual cases showed that these correspond either to intrinsically unfolded proteins, similar to the example of DSRP in Fig. 5, or to symmetric multimers. In principle, more realistic prediction results could be obtained for the former by using a separate chemical shift statistics restricted to intrinsically unfolded proteins, and for the latter by explicitly taking into account the symmetry in a modified theory.

Fig. 6
figure 6

Number of overlapped peaks in [1H,15N]-HSQC spectra for 2,174 proteins with chemical shift assignments from the BMRB that are sufficiently complete such that more than 70 % of the [1H,15N]-HSQC peaks are assigned. a Overlap measured by a hard cutoff \( \delta^{\prime } = 1.25 \delta \) (see “Materials and methods”), where δ is the overlap tolerance, for the expected peaks at the positions given by the experimental chemical shift assignments plotted versus the overlap measured for the same peaks using the “soft” Gaussian overlap probability function of Eq. 4. b Overlap measured using the soft criterion for the same peaks as in panel a plotted versus the overlap predicted from the sequence alone using Eqs. 910

Conclusions

In this paper we have introduced a new general method for estimating the overlap of peaks in NMR spectra that can be applied already if only the sequence of the protein is known, e.g. before starting sample preparation and NMR measurements. Results for the average overlap are in agreement with the amount of overlap measured in experimental spectra although the method can obviously not predict with certainty whether an individual peak will be overlapped or not. The overlap estimation can be used to distinguish proteins with potentially heavily overlapped spectra from those with better chemical shift dispersion based on primary structure information alone.

Overlap estimation can be used, for instance, to support setting proper signal sampling parameters for NMR experiments, e.g. the number of dimensions, maximum evolution times, use of linear prediction, non-uniform sampling and other resolution-improving techniques. Overlap prediction can support the design of overlap-optimized labeling schemes. For a given sequence and a given number of amino acid types that are to be labeled, these can be chosen so that the predicted amount of overlap is minimal, while preserving the maximal information possible. Consequently, there will not be a unique optimal solution, but rather a set of efficient solutions that are characterized by the fact that their overlap cannot be improved further without losing information. It can also be envisaged to use overlap prediction in automated assignment algorithms, e.g. to define a priori probabilities for the observation of peaks, for locally steering peak picking algorithms, and for weighing peak assignments and penalties for peak degeneracy in scoring functions for assignments (Schmidt and Güntert 2012). Similar applications are conceivable for the identification of conformational restraints for structure calculations. Conformational restraints derived from peaks in less overlapped regions are potentially safer to introduce into the structure calculation. A priori overlap prediction based on the present theory can therefore play a role in improving the reliability of automated spectra analysis and protein structure determination.