1 Introduction

The interpretation of the data from electrospray-ionization mass spectrometry (ESI-MS) greatly benefits from uncovering the role of factors controlling the degree of protein ionization, which lead to the observed charge-state distributions (CSDs) [1, 2]. The current view of protein electrospray invokes different mechanisms to explain the ionization behavior of unfolded proteins, and folded globular proteins under non-denaturing conditions (the former may be either proteins under denaturing conditions or intrinsically disordered proteins). The degree of ionization of unfolded proteins is considered to be controlled by the apparent gas-phase basicity of protein ions (GBapp) relative to that of the solvent [35]. GBapp measures the propensity of the ionizable groups of desolvated protein ions to acquire a proton. It approaches the GB of the solvent for unfolded proteins in their highest observable charge state [4].

For folded proteins, the interpretation is less straightforward. The extent of protein ionization has been interpreted in terms of GB also in this case [69]. In particular, the GBapp of folded cytochrome c has been calculated from the crystallographic structure, accounting for Coulomb repulsions [9]. The derived value for the 11+ ion was little below the value for water, suggesting that the GB model could be extended to folded proteins. However, the most accredited hypothesis considers the charge of the precursor ESI droplet as the key factor determining the extent of protein ionization [10, 11]. In turn, the droplet charge is assumed to be close to the Rayleigh limit. In support to this hypothesis, plots of average protein charge as a function of protein mass can be fitted well by the Rayleigh equation using the surface tension of water and a droplet radius equal to that of the globular protein structure [10]. This hypothesis would imply that the charge of the protein depends on solvent surface tension following the Rayleigh equation [12]. Such a dependence could not be reproduced by experiments on either folded or unfolded proteins [1, 2, 1215], although solvent surface tension certainly plays an important role in the ESI process [1, 2], and has also been suggested to explain some effects of supercharging agents [11, 16].

Here, we use computational methods to investigate the relevance of GBapp for folded proteins under electrospray conditions, with no assumption on a role of the charge of the precursor ESI droplet. Our test systems are nine structurally diverse and well characterized proteins, spanning a range of molecular weight from 4.0 to 29.2 kDa and a wide range isoelectric point (i.e., from basic to acidic). We first calculate GBapp by introducing a Monte-Carlo/molecular-dynamics (MC/MD) scheme, which takes into account the ionization of basic (Arg, Lys, His, and N-terminus) and acidic (Asp, Glu, and C-terminus) groups. This procedure explicitly considers the influence of protein structure on the intrinsic gas phase basicity of each ionizable group, allowing for a combined exploration of the conformer and protomer space. This method leads to the identification of a set of lowest-energy protomers for each value of protein net charge. In most cases, self-solvation networks lead to maintenance of zwitterionic states. Next, we propose a simple mathematical model based on the results of these atomistic investigations. This model has no adjustable parameters and it reproduces the well-known correlations between the protein net charge, q, and both mass [10] and solvent-exposed surface [17]. Our model hints to GBapp as a key factor for the ionization of folded proteins, suggesting that protein ionization depends on intrinsic properties of the protein structure and on the GB of the solvent [4].

2 Computational Details

2.1 Proteins

Nine globular proteins of diverse size and fold were considered (Table 1): ragweed pollen allergen from Ambrosia trifida (40 residues, pdb entry 1BBG), bovine pancreatic trypsin inhibitor (56 residues, pdb entry 1UUA), C-terminal domain of the ribosomal protein L7/L12 from E. coli (68 residues, pdb entry 1CTF), α-amylase inhibitor tendamistat (74 residues, pdb entry 1OK0), human ubiquitin (76 residues, pdb entry: 1V80), Ribonuclease SA (96 residues, pdb entry 1C54), E60Q mutant of human FK506 binding protein-12 (107 residues, pdb entry 2PPP), hen-egg white lysozyme (129 residues, pdb entry 1LZT), human carbonic anhydrase II (260 residues, pdb entry 2CBA). The 1LZT, 1BBG, 1UUA, and 1OK0 proteins feature 4, 4, 3, and 2 disulphide bridges, respectively. Breaking of these bonds was not considered.

Table 1 Proteins Studied in this Work: Ragweed Pollen Allergen (1BBG), Bovine Pancreatic Trypsin Inhibitor (1UUA), C-Terminal Domain of the Ribosomal Protein L7/L12 from E. coli (1CTF), α-Amylase Inhibitor Tendamistat (1OK0), Human Ubiquitin (1V80), Ribonuclease SA (1C54) E60Q Mutant of Human FK506 Binding Protein-12 (2PPP), Hen-Egg White Lysozyme (1LZT), and Human Carbonic Anhydrase II (2CBA). For Each Protein, the Table Lists the Number of Residues, the Mass (kDa), the Fold, the Number of Basic (Arg, Lys, His, and N-terminus) and Acidic Residues (Asp, Glu, and C-terminus) Considered in the Present Work and the Main Charge State Observed Experimentally for Ions Originated from Water along with the Charge Predicted in the Present Study

Available evidence suggests that, under mild ESI conditions, protons are mostly exchanged among few sites (i.e., mainly Arg, Lys, His, Glu, and Asp side chain and the N- and C-termini [18]. Thus, in order to keep the sampling problem tractable, only protonation and deprotonation of these residues was considered.

2.2 Protomer Space Exploration

Predicting the distribution of protonated sites within a protein is not trivial, since the number of possible protomers can be prohibitively large to be explored exhaustively by any theoretical approach, even for a small protein. To cope with this problem, several Monte-Carlo (MC) protocols have been proposed [19] (and references therein). These schemes suppose that the protein structure does not change with the protonation state and they usually assume that the (average) protein structure in aqueous solutions or the crystallographic structure is a good approximation of the gas-phase structure. This may be generally true for the protein backbone. However, it might not necessarily hold for side chains. To tackle this issue, the present study employs a combined MC/MD sampling scheme using the OPLS/AA force field with GB corrections. Indeed, standard force fields for biomolecular simulations are unable to describe bond breaking and forming. This poses a serious limitation to the exploration of the protomer space using molecular-mechanics schemes. Here we propose to augment the standard force field energies, E FF, with additional energy terms associated to the GB of ionizable residues, introducing the following corrected potential function:

$$ {{E}_{\text{corr}}} = - \sum\limits_i^{\prime } \mathop{{\left( {\text{GB}} \right)}}\nolimits_i {{\delta }_i} + {{E}_{\text{FF}}} $$

where the summation runs over all of the ionizable residues and δ i is 1 if the i-th residue is ionized and 0 otherwise.

We chose the OPLS/AA force field [20], because it offers the most complete set of base/conjugate acid pairs. The adopted correction was validated against density functional theory (DFT) calculations on the small ragweed pollen allergen protein. DFT calculations were performed using the Becke exchange [21] and Lee-Yang-Parr [22] correlation functionals (BLYP) within a hybrid Gaussian. A plane wave approach was adopted [23], along with norm-conserving pseudopotentials [24], to describe the core electrons. The TZV2P Gaussian basis set was used for valence electrons of all atoms, while the auxiliary electron density was expanded in plane waves up to a cutoff of 280 Ry. Interaction between periodic images in the reciprocal space was removed according to the decoupling scheme presented in Ref. [25]. Dispersion energy was included according to Ref. [26]. We will refer to the dispersion-corrected DFT energy as DFT + D. As previously described [27], the adopted DFT scheme has been validated against more accurate quantum-chemical calculations (DFT/B3LYP and MP2). The wave function has been optimized according to Ref. [23]. The calculations were carried out with the CP2K code [23].

The comparison between DFT + D and corrected force field calculations was performed over 35 selected protomers (over a total of about 460 possible protomers) of the ragweed pollen allergen protein at q = 1+. For each protomer, conformational sampling was carried out according to the previously described protocol [27]. Several sets of GB values taken from the literature [9, 28, 29] were tested. The best agreement between DFT + D and corrected force field was obtained for the GBs calculated previously [27] for amino acids in an extended conformation, where the ionized groups do not make any short-range interaction such as hydrogen bonds or salt bridges. Indeed, this type of interaction is reasonably well described by the force field and there is no need to include it in the calculation of the amino-acid reference GB.

DFT energies do not correlate with non-corrected force-field energies (Figure SI-1A). The addition of the correction introduces a marked linear correlation between the two different energy evaluations (R 2 = 0.93, Figure SI-1B). The data dispersion indicates that the corrected force field allows one to discriminate between high and low energy protomers but not to identify small energy differences and, thus, to identify the single lowest-energy protomer. The standard error of the estimate using the linear correlation of Figure SI-1B is σ = 35 kJ/mol. If we assume a normal distrubution of the DFT + D energies around the estimate obtained from the corrected force field, there is a confidence of 68.3 % and 99.7 % that the DFT + D energy is within σ and 3σ (about 100 kJ/mol) from the estimate, respectively. Indeed, all of the conformers located within 10 kJ/mol from the DFT + D minimum fall within 100 kJ/mol above the OPLS/AA. Similar discrepancies are found using the Amber [30] and GROMOS [31] force fields (data not shown). Hence, we carry out a statistical analysis of the protomer properties within a given energy cut off, which yields a high confidence to include the minimum-energy protomer. We will refer to these protomers as the most probable protomers. The discussion presented in this work is based on a 100 kJ/mol cut off. Only few structures are within this cut off (about 10 out of thousands). Different choices of the cut-off energy (from σ to 3σ) give comparable results (data not shown).

Using the OPLS/AA force field with GB corrections, protonation sites were randomly permuted and the total energy was calculated. At each MC step, a proton exchange is accepted or rejected according to the Metropolis criterion. The structure of each considered protomer was relaxed with the following simulated annealing-like procedure. First, a 400-ps, high-temperature (400 K) MD simulation was performed. This temperature was selected after several careful tests and it allows for an exhaustive sampling of side-chain conformations without disrupting, in the relatively short simulation time, the protein secondary structure. The resulting trajectory was split in 60 equally spaced time windows. In each of these windows, the geometry of the lowest-energy conformation was optimized. The optimized structure was then employed in the MC procedure. For each value of net charge, about 103 lowest-energy configurations were sampled. This procedure converges in a relatively small number of MC steps. Indeed, MC searches starting from different protomers substantially yielded the same final charge configurations, differing at the most in the position of one or two protons.

The initial structure for the MC procedure was extracted from a 2-ns MD simulation at ambient conditions of the protein in aqueous solution (with counter-ions added). These preparatory simulations were long enough to stabilize the protein dynamics, as deduced from the root mean square displacement (RMSD) of the backbone heavy-atoms. In all cases, the structure closest to the average one was taken. The initial charge configuration for the MC process, instead, was randomly generated allowing for positively charged basic residues and negatively charged acidic residues.

The time evolution of any lowest-energy protomer obtained from the MC/MD protocol was followed at 300 K for 20 ns (Table 2) and, in the case of the q = 8+ ubiquitin ion, for 1 μs (Figure SI-9). All the calculations were carried out using the GROMACS [32] MD package. A time step of 1.5 fs for the integration of equations of motion was used in all of the simulations.

2.3 Calculation of the Apparent Gas-Phase Basicity

The GBapp is an extension of the concept of GB [4], and quantifies the ability of a protein, in a given conformation and charge state, to increase its charge state through the addition of a proton. The GBapp of a protein corresponds to the GB of the residue (embedded in the protein environment) with the highest gas-phase basicity. The GBapp,i of the i-th residue in a protein with total charge q is defined as [4]

$$ \mathop{\text{GB}}\nolimits_{{{\text{app}},i}} = \mathop{\text{GB}}\nolimits_i - \left( {E_{\text{FF}}^{{\left( {i,q} \right)}} - E_{\text{FF}}^{{\left( {i,q - 1} \right)}}} \right), $$

where GB i is the GB of the i-th amino acid in the gas phase and \( E_{\text{FF}}^{{\left( {i,q} \right)}} \) (or \( E_{\text{FF}}^{{\left( {i,q - 1} \right)}} \)) is the energy of the protein with that residue protonated (or non protonated). In contrast to the original formulation, developed for a coarse-grained representation of an unfolded protein [4], we include in the GBapp calculation all of the classical energy terms considered by a force field. No vibrational correction was taken into account. The justification for this choice has been discussed previously in the literature [27, 33].

We stress that, in the present study, we do not report the GBapp of the lowest-energy protomer of a protein in a given charge state, which might be ill-defined as discussed in the previous section. Rather, we extrapolate trends of the average GBapp for a large set of proteins and charge states, and try to relate these trends to the experimentally observed CSDs and the GB of the solvent from which the ions have been originated.

3 Results and Discussion

3.1 Protein Structure in Vacuo

We have analyzed the conformational and protomer space of nine proteins featuring different size, fold, and pI (Table 1). Significant conformational rearrangements take place upon desolvation. However, the most probable protomers identified by the MC/MD procedure conserve their secondary and tertiary structure upon passing from the aqueous solution to the gas phase at room temperature (Table 2 and Figure SI-2). The gyration radius (R g), calculated over the nanosecond time-scale, decreases in a similar way for all of the charge states considered here. This contraction involves the solvent-exposed side chains, which fold onto the protein surface and, to a lesser extent, also the backbone. These rearrangements lead to the formation of new intramolecular hydrogen bonds (HBpp). The total surface area (A tot) also decreases, whereas the hydrophobic portion of the total surface area (A phob) increases, as already reported [3436]. The proteins turn out to be more rigid in the gas phase, as indicated by the RMSD of backbone atoms around their average positions. The most dramatic change is observed for the small pollen allergen, which has relatively flexible parts, pointing to the solvent in the native structure. Moreover, the protein structural rearrangements on passing from solution to gas phase lead to the formation of salt-bridges, as also reported in Ref. [34]. Further structural features (Table 2) are in close agreement to those found in previous simulation studies [3436] and, hence, their detailed description is omitted.

3.2 Gas-phase Basicity and Protein Ionization

The number of ionized residues, n IR, and the number of hydrogen bonds formed by ionized residues, n iHB, per unit of protein surface is constant among our protein data set (Figure SI-3), with about 0.16 ionized residues and 0.43 hydrogen bonds per nm2, for the most probable protomers of the proteins considered here, in their predicted most populated charge state: n IR = 0.148S + 1.459 (R 2 = 0.976) and n iHB = 0.434S − 1.438 (R 2 = 0.989), with S in nm2. Both positive and negative ionized residues tend to form the maximum number of hydrogen bonds, compatible with the geometry of the gas-phase structures (Table 3). In general, protonated amino groups donate three hydrogen bonds (one for each N–H bond) and carboxylates receive four hydrogen bonds. The average number of hydrogen bonds per residue type is roughly constant across the proteins investigated. The differences are within the standard deviation (Table 3).

Table 3 Average Number of Hydrogen Bonds Formed by each Type of Ionized Residue in the Most Probable Protomers (Protomers Under 100 kJ/mol from the Energy Minimum) for the Proteins of Table 1. The Total Number of Ionized Hydrogen Bonds is Reported in the Last Column. The Standard Deviation from the Average is Given in Brackets

Zwitterionic states are mostly retained, especially for low charge states, as can be seen by plotting the number of ionized residues as a function of the protein total charge (Figure 1 and Figure SI-6). Indeed, intramolecular hydrogen bonds, salt-bridges, π-charge interactions, and long-range electrostatic interactions can compensate for the thermodynamic penalty of charge separation in vacuo, providing internal solvation [27]. This finding is consistent with recent experimental evidence [6, 37, 38]. It also supports the hypothesis that a higher propensity for zwitterionic states of folded versus unfolded proteins can lower the net charge of the former, contributing to conformational effects in ESI-MS [39]. In contrast, there is no clear dependence of the number of salt bridges on the protein size (Figure SI-4). This result suggests that the formation of salt bridges depends on peculiar features of the protein structure.

Figure 1
figure 1

Average number of ionized residues (circles) in the most probable protomers of the hen-egg white lysozyme as a function of the protein net charge (q). The data for the other proteins considered in the present study are reported in Fig. SI-6. Standard deviation from the average is given as error bars. The interval of ionized residues spanned by the most probable protomers for each charge is reported as a cyan bar. The minimum and the maximum number of possible ionized residues for each total charge are indicated by the green and the red line, respectively. The vertical dashed blue line indicates the most probable charge state starting from aqueous solutions

The GBapp values decreases linearly as the protein net charge increases (Table SI-1, R 2 ≥ 0.99 for all of the nine proteins) [4, 9, 40]. As expected, the slope of the line depends on the specific protein. Most notably, the intersection of the GBapp fitted line with the line of solvent GB corresponds with remarkable agreement to the experimental main (most abundant) charge state under mild ESI conditions [7, 10] (Figure 2 and Figure SI-7). Instead, GBapp turns out to be systematically underestimated when the calculation is performed constraining the non-hydrogen atoms in the position of the NMR or X-ray structures (Figure 2). By reproducing the experimental charge for proteins of different size, our calculations are also consistent with the well-known charge-to-mass empirical relation \( q \propto \sqrt {M} \) [10, 41].

Figure 2
figure 2

Average GBapp (in kJ/mol) calculated for the most probable protomers of hen-egg white lysozyme. Black circles and orange circles represent, respectively, values calculated from optimized and the nonoptimized (pdb) structures. The black line and orange line are the result of a linear fitting (in both cases the correlation coefficient, R 2, is around 0.995). Standard deviation from the average is given as error bar (when not visible, the standard deviation is smaller than the symbol size). The data for the other proteins considered in the present study are reported in Fig. SI-7. The horizontal lines indicate the GB of various solvents: water (red line), isopropanol (blue line), ammonia (purple line), triethylammonium bicarbonate (cyan line), 1,5-diazabicyclo[4.3.0]-5-ene (green line). The experimental main (most intense) charge states observed from these solvents are reported with symbols colored accordingly. Comparison with the most intense charge state is made instead of maximum or average charge state because of its less ambiguous determination from literature data [39, 51]

It has been previously shown that such an empiric charge-to-mass relation reflects a linear log-log charge-to-surface relation, which has been proven to hold both for folded [41, 42] and unfolded [43] proteins. The relation holds when comparing the surface of hydrated proteins or protein complexes with different surface-to-mass relation [41]. Our results also support a linear log–log correlation between charge and desolvated protein surface. Indeed, a linear correlation seems to be respected when the area calculated from the gas-phase structures is employed (Figure SI-5). However, more datapoints will be needed to explore larger molecular weights.

Table 2 Average Structural Properties at 300 K in Water and in Vacuo for the Proteins of Table 1. From Left to Right: Radius of Gyration in Water (R g, wat in Å); Radius of Gyration of the Backbone in Water (R g, bb wat in Å) Radius of Gyration in Gas-Phase (R g, gp in Å); Radius of Gyration of the Backbone in Gas-Phase (R g, bb gp in Å); Protein–Protein Hydrogen Bonds in Water (HBwat); Protein–Protein Hydrogen Bonds in Gas-Phase (HBgp); Total Surface Area in Water (S tot, wat in Å2); Total Surface Area in Gas-Phase (S tot, gp in Å2; Hydrophobic Surface in Water (S pho, wat in Å2); Hydrophobic Surface in Gas-Phase (S pho, gp in Å2); Root Mean Square Displacement of the Backbone in Water (RMSDwat in Å); Root Mean Square Displacement of the Backbone in Gas-Phase (RMSDgp in Å). The Proteins in Vacuo are in the Charge State Predicted in the Present Study, whereas the Proteins in Water have all of the Acidic and Basic Residues Ionized but Histidines, which are Assumed Neutral. Standard Deviations are Reported in Parenthesis

3.3 A Simple Model for Proteins in the Gas Phase

We now extend the correlation between charge and mass to any folded protein, by introducing a simple and general model energy function, without adjustable parameters. This model is based on a plausible assumption: the experimentally observed charge-to-mass relation can be interpreted as a linear combination of energetic contributions due to electrostatic repulsion and internal solvation. The latter is considered to be proportional to the protein surface and it is based on the above observation that the number of ionized residues and the number of hydrogen bonds they present per surface unit is constant.

We start by modeling a protein in the gas phase as a sphere of radius R and density ρ, with a net charge q uniformly distributed on its surface. The electrostatic energy of the protein can be expressed as:

$$ U(q) = \frac{1}{2}\frac{1}{{4{{\epsilon }_0}\pi }}\frac{{{{q}^2}}}{R} - 4\pi \xi {{R}^2}. $$
(1)

The quadratic dependence of the electrostatic energy on the total protein charge (first term) accounts for the linear change in GBapp described above (see also Figure SI-8). The second term takes into account the stabilization by intermolecular interactions. It is proportional to the surface area 4πR 2 via a parameter to be determined by fitting our computational results (ξ ≈ 0.994 N/m with ρ from Ref. [52]). The protein is unstable when U(q) ≥ 0. Hence, the maximum charge attainable is

$$ q = 4\pi \sqrt {{2{{\epsilon }_0}\xi {{R}^3}}} $$
(2)

or

$$ q = 2\sqrt {{\frac{{6\pi {{\epsilon }_0}\xi }}{\rho }}} \sqrt {M}, $$
(3)

being M = 4/3πρR 3.

A numerical model based on a charged ellipsoid shows that, also in such a case, \( q \propto \sqrt {M} \) (see Supplementary Information). Thus, the \( q \propto \sqrt {M} \) relation is rather independent on the specific shape being used. More generally, it is easy to see from a simple dimensional analysis that any stabilizing contribution (−ξS) that depends on the surface area (S) yields a \( \sqrt {M} \) dependence of the maximum charge possible. Indeed, the electrostatic potential is dimensionally proportional to charge2 × length-1, whereas the stabilizing contribution is proportional to length2. Thus, the square of maximum charge is proportional to a volume, and consequently the charge must be proportional to the square root of the mass.

Substituting our fitted ξ value in Equation (3) yields, with remarkable accuracy, not only the main charge state displayed by the proteins considered in this study, but also the experimental q versus M curve for a large variety of other proteins (Figure 3). Notice that we compare our predicted maximum charges q with the reported experimental main charge. This approximation is justified by the fact that experimental main and maximum charges are very similar for folded proteins [10] and they are related to each other.

Figure 3
figure 3

Experimental average charge state as a function of the protein mass [10]. The curves predicted by the model based on the Rayleigh charge [10] (blue line, reduced χ 2 = 51) and the spherical model introduced here (red line, reduced χ 2 = 0.25) are also shown

Our simple model Equation (3) yields formally a result similar to that obtained with the Rayleigh-charge hypothesis. Both predict a \( \sqrt {M} \) dependence of the protein charge. However, in contrast to the Rayleigh-charge hypothesis, the present model does not involve adjustable parameters and does not introduce any dependence on the solvent surface tension [12]. The only parameter entering in Equation (3) is the stabilization term due to internal solvation and it is obtained from the quantities calculated from the set of proteins considered here. This model eliminates the explicit dependence on surface tension that derives from the application of the Rayleigh equation to the prediction of protein ionization by electrospray. At the same time, the simplifying assumptions of the model do not allow to capture the indirect role that the solvent surface tension might play on protein ionization via its effect on the electrospray mechanism. Surface tension is the limiting factor for droplet charging during electrospray, as indicated by the Rayleigh equation, and conditions might be found in which it becomes the limiting factor also for protein ionization [1, 2].

It should also be underscored that the results of this study do not help discriminating between the ion-evaporation model and the charged residue model, concerning the mechanism of production of gas-phase ions during ESI [1, 2]. The present model is not meant to test those hypotheses and is compatible with both mechanisms.

4 Conclusions

The present computational study establishes the relevance of GBapp for folded proteins under electrospray conditions. The calculations do not assume a role for the charge of the precursor ESI droplet. Our proposed model reproduces the well-known relationship between observed charge and protein mass based only on intrinsic protein features and solvent nature. Hence, the liquid medium of the precursor droplet provides or accepts protons according to its GB relative to the GBapp of the protein. Along with previous studies [2, 4], these results support a model in which the same principle (i.e., the GBapp of the protein relative to the GB of the solvent) is applied to folded and unfolded proteins, in order to explain the experimentally observed charge values.