Introduction

Acids and bases in solution bind and release protons in a process that changes their molecular charge distribution, influencing solubility, and molecular recognition and reactivity. Many biological and bio-active molecules have pKas in the physiological pH range, existing in a mixture of protonation and tautomeric states. The knowledge of which protonation states are energetically accessible is important for the design of molecules with desired function [1, 2]. The binding or loss of a proton represents one of the simplest reactions, so the calculation of the relative free energy of protonation states as a function of pH represents a powerful test of biomolecular modeling methodologies [3, 4].

The recent SAMPL6 Blind pKa Prediction Challenge focused on the prediction of pKas for 24 small molecules [5]. As the participant submissions were analyzed and compared it became apparent that the description of the free energy landscape of molecules with multiple protonation states is not simple and that lists of many pKa values is in fact not the best way to describe the behavior of the molecule as a function of pH. Attempting to capture all necessary information regarding pKa predictions, the SAMPL6 pKa Challenge supported three different reporting schemes (submission types): microscopic pKa values (type I), fractional populations of microstates with respect to pH (type II), and macroscopic pKas (type III). These reporting schemes captured different aspects of the predictions, however none of them has proved to be optimal. Here we present a different reporting scheme that provides a complete and concise description of the thermodynamic behavior of molecules with multiple protonation and tautomeric states, and allows the derivation of pKas.

Proteins can have innumerable protonation and tautomeric microstates

The complexity of protonation equilibria of a molecule can vary enormously depending on the number of possible titratable groups and the interactions amongst them. Thus, a simple molecule with a single titratable group has a single well-defined pKa, which is the pH where the species with different numbers of protons have the same free energy and thus the same concentration. For a single protonatable group the reaction is:

$${\text{AH}} = {\text{A}}^{ - } + {\text{H}}^{ + }$$
(1a)

Defining the pKa as − log10Keq leads to:

$${\text{pK}}_{{\text{a}}} = {\text{pH}} - \log _{{10}} \frac{{{\text{A}}^{ - } }}{{{\text{AH}}}}$$
(1b)

We will define a protonation macrostate by the total charge of the molecule while the microstate defines the specific protonation and tautomeric state of all protonatable sites. As the number of protonatable sites in a molecule increases, the number of possible microstates the molecule can access increases. Proteins and other (bio)polymer polyelectrolytes [6] can have many acidic and basic substituents. If we only consider protonatable sites that can gain or lose a single proton (A vs. AH or B vs. BH+) then there are 2n different distributions of protonation states for n protonatable groups. On average, 25% of protein residues are either Asp, Glu, Lys, or Arg [7], providing a very large number of possible microstates. Long-range electrostatic interactions lead to the ionization of all residues being interdependent [8]. Computational tools have been developed that view the protein environment as perturbing the pKas individual residues would have in solution [9,10,11]. Given the huge number of possible microstates, Metropolis Monte Carlo sampling is typically used to sample the Boltzmann distribution of protonation states at each pH [11]. Thus, proteins have many protonatable sites, where the interactions are not negligible, but can be treated as a sum of separable, individual interactions. Typically, calculations consider a relatively rigid protein and calculate the electrostatic interactions with the Poisson-Boltzmann equation of continuum electrostatics. Evolving methods allow the protein to move via classical MD simulations, with the protonation change sampled either by a separate MC analysis or by lambda dynamics within the MD trajectory [12,13,14,15]. However, these methods will not work to define the pH dependent behavior of small molecules with multiple protonation states.

SAMPL6 pKa challenge targets are small molecules with multiple protonation and tautomeric microstates

Organic molecules with multiple protonatable sites play roles in metabolism and are design targets for drugs [1, 2]. The pH dependent behavior of these molecules is at a level of complexity between a single protonatable group (Eq. 1) and a complex polyelectrolyte such as a protein. The SAMPL6 pKa Challenge chose 24 extended organic compounds as a test case for a blind prediction of molecule pKas [5, 16]. (https://github.com/samplchallenges/SAMPL6/tree/master/physical_properties/pKa/microstates). All SAMPL6 molecules have multiple potential protonation macrostates (states with different total charge) as well as different energetically accessible tautomer states (same total charge but different protonation site), each of which represents a defined microstate. There are three to six protonatable sites in each molecule, substantially fewer than in proteins, but still sufficient to generate tens of protonation and tautomeric microstates (Fig. 1) [4, 5, 17]. However, the coupling amongst the protonatable sites in these molecules does not allow separation into independent units, with simply summable interactions. Rather, each molecule must be treated as a whole with its microstate energy a function of the proton distribution and molecular conformation. The prediction methods used in submissions to the SAMPL6 pKa Challenge range from knowledge-based empirical methods to detailed quantum mechanical simulations. Several recent papers have described some of the results [16, 18,19,20,21,22,23] and reference [17] provides an analysis of the predictions submitted for all molecules.

Fig. 1
figure 1

Network of protonation and tautomeric microstates for SAMPL6 pKa challenge target SM07

Figure 1 shows the network of 11 considered protonation and tautomeric microstates of the SAMPL6 pKa Challenge molecule SM07, which will be described in detail here. While it is theoretically possible to enumerate additional microstates, this set combines microstates suggested by SAMPL6 pKa Challenge organizers (using Epik from the Schrodinger Suite v3.4 and QUACPAC from the OpenEye Toolkit v2017.Feb.1 plus any additional microstates included in SAMPL6 challenge submissions). The protons of interest are denoted by blue balls on top of the nitrogen which is the proton acceptor. Each column has a different number of dissociable protons, indicated at the top. All microstates in a column are tautomers with the same total charge; their vertical order is arbitrary. All the tautomers in a column contribute to the macrostates with 4 (4H) to zero (0H) dissociable protons. Black double-headed arrows indicate pKas that were reported in the SAMPL6 Challenge; red arrows are the transitions between tautomers. Some transitions, such as between 13 and 7 required tautomer changes. Likewise, transitions between tautomers in the top and bottom rows (e.g. between microstates 2 and 3) are also well-defined, but are not shown here for clarity. The numbers associated with each microstate simplify the microstate IDs assigned by the SAMPL6 pKa Challenge, which have the form: SM07_microXXX, where XXX are three digits. For example, microstate 4 in this figure corresponds to SM07_micro004.

Figure 1 shows many closed reaction cycles. One example, starts with state 4 (1H); the shift of a proton position leads to tautomeric microstate 2 (1H); the loss of a proton generates microstate 12 (0H); and proton binding regenerates microstate 4. In a network that is thermodynamically consistent the summed change in free energy for these three reactions should equal zero. The network shows cycles, including many with 3 microstates as well as larger ones such as those that connect microstates 12 (0H) and 16 (4H) through different tautomers of the intermediate protonation states. Thermodynamic consistency can provide a test of a set of calculations for a given molecule.

What information was requested for the SAMPL6 pKa challenge?

Three types of submissions for predictions were requested: microscopic pKas for related microstate pairs (type I), fractional microstate populations in the pH interval 2 to 12 (type II), and macroscopic pKas (type III). We will see that type I and type II are formally identical as long as the same microstates are included in both descriptions and that type III misses important information needed to see if a numerical pKa that matches an experimental value captures the correct protonation and tautomeric states.

Macroscopic pKa entries were reported (type III submissions). Macroscopic states combine all the (tautomer) microstates that have the same number of protons and thus the same net charge. Macroscopic pKas are closest to experiments that monitor the proton uptake of the molecule as a function of pH, such as electrochemical titrations or spectrophotometric titrations [5]. However, the macroscopic predictions did not require an assignment of which microstates are involved or even how many protons are associated with the beginning and end state. Thus, it may not be clear if the transition is, for example, between A and AH or between AH and AH2+. Unfortunately, this ambiguity is also a problem for many of the experimental measurements of these complex molecules. Thus, a spectroscopic or potentiometric titration provides information on the pKa value of a proton binding event without knowing the macrostate (charge state) or microstate(s) (tautomer(s) with that charge) that are connected by the pH-dependent transition. This lack of specificity made it difficult to determine if different methods predicting a similar pKa were referring to transitions between even the same macrostates of the molecule [5].

The minimum information needed to describe the network of protonation and tautomeric states: microstate ∆G°s at one pH provides one number to rule them all

The complexity of the SAMPL6 protonation and tautomer microstates for each molecule led to an unanticipated open question: What is the best way to report the predictions? The ideal description should provide the minimum information needed so the free energy landscape for all the protonation and tautomeric microstates in a network, such as that shown in Fig. 1, can be described at each pH. It should allow easy comparison with experimental measurements and between multiple predictions for the same system. It should define the distribution of tautomeric microstates for each protonation macrostate and include information about high energy microstates, which could become important in a specific binding pocket or in a reaction mechanism. It should make it possible to check for thermodynamic consistency of cycles of changes in protonation and tautomer states such as seen if Fig. 1.

SAMPL6 pKa Challenge type I submissions report the predicted pKas for transitions between selected pairs of prespecified microstates. This information is far richer than a list of macroscopic pKas. However, as the list of individual microscopic pKas were analyzed it became apparent that this format is also far from ideal. A list of pKas provides only a local view of the relative proton affinity of pairs of states of the molecule. In addition, the list can have more information than is needed. For example, for SM07 (Fig. 1) there are 24 possible pKas between all pairs of microstates that vary by one proton (adjacent rows), of which 17 are shown. However, we will show that knowing the relative free energy of the 11 individual microstates at a single reference pH (∆G°) can completely describe the free energy (∆G) of all microstates at any pH. Populations of each microstate can then be obtained given the ∆Gs to recover titration curves. In addition, is not readily apparent if the overall free energy landscape built up from the network of individual pairwise pKas is thermodynamically consistent, while this is straightforward to see when each microstate of the molecule is associated with its relative free energy.

Deriving ∆G°s for a network of protonation and tautomeric states from a list of pKas

Describing a molecule with 3 pKas and no tautomeric states by the relative microstate free energy at pH 0 or 7

Currently, the information we have about the protonation and tautomeric states of protonatable molecules is often collected as a set of pKas and this was the information submitted to SAMPL6. We will therefore show how to use these pKas to build up a standard state free energy ladder, which is the free energy differences between microstates at one pH. The first example considers three pKas separating four states, denoted A, B, C, and D (Table 1). A has three dissociable protons and D none. The analysis would be the same if the input pKas come from experiment or simulation. Tautomeric microstates, with the same number of protons but at different locations on the molecule, will be considered in the next section. The pKas are at − 2.17, 5.61, and 13.77. These are taken from the EPIK predictions for the pKa between the SM07 microstates 14, 7, 4, and 12 (Figs. 1,2) [24]

Table 1 The pKas and number of protons provide the input information needed to describe a molecule with 4 protonation states separated by 3 pKas
Fig. 2
figure 2

Relative standard state free energies (∆G°) and relative number of protons (∆m) is all that is needed to completely describe the pH dependence of a system with multiple protonation states. a The relative free energies of the states A,B,C, and D as a function of pH given the pKas in Table 1 transformed to standard state free energies, ∆G° (Table 2). Relative free energies of microstates change linearly with respect to pH (Eq. 2b). Squares are input pKas which can be experimentally observable. Circles mark ∆G°jB, the free energies at pH 0 of the other three states relative to B. Horizontal arrows at the bottom show the state at lowest free energy (dominant population) in each pH range. b states A, B, C and D with state C as the reference; c Titration showing relative state populations vs pH predicted using ∆G°s in Table 2 and Eqs. 3 and 4. This plot is the same independent of which state is used as the reference. ∆G is given in unitless free energies where a unit change in ∆G yields a tenfold population change

To determine the relative free energy of the four microstates at a single pH one state is chosen to be the reference state. The reference state and pH are arbitrary choices, but simply need to be applied consistently. We will describe the calculation of the relative state free energies with B as the reference (∆G°jB) and then show that using C as a reference (∆G°jC) provides the same relative energies between all states, but with a constant offset equal to the energy difference between microstates B and C (∆G°BC). The reference state is defined here as the second term in the subscript.

If B is the reference then its energy is zero and independent of pH, as shown by the horizontal black line at ∆G = 0 in Fig. 2a. The pKAB at − 2.17 gives the pH where state A and B have equal energy so the line describing the pH dependence of the relative free energy of A crosses B here. The free energy difference between A and B at any other pH is:

$$\Delta {\text{G}}_{{{\text{AB}}}}^{{{\text{pH}}}} = \Delta {\text{m}}_{{{\text{AB}}}} {\text{C}}_{{{\text{units}}}} ({\text{pH}} - {\text{pK}}_{{{\text{AB}}}} )$$
(2a)

∆GAB is 0 when the pH is equal to the pKAB. Cunits moves the values into the desired units of energy. It is 1.36 for kcal/mol or 5.69 for kJ/mol. We will use Cunits = 1, which is RTlog1010. Thus, one unit of energy changes the equilibrium constant by a factor of 10 at the reference temperature. A change in pH of 1 unit leads to a 1 unit change in ∆GAB if a proton is gained or − 1 unit if a proton is lost. Cunits can be referred to as pH units. As A has one more proton than B (∆mAB = 1) its energy increases with pH as the proton concentration decreases. The standard state reference energy for A is its energy relative to (the reference state) B at the (reference) pH of 0 is:

$$\Delta {\text{G}}_{{{\text{AB}}}} ^{ \circ } = \Delta {\text{m}}_{{{\text{AB}}}} {\text{C}}_{{{\text{units}}}} ( - {\text{pK}}_{{{\text{AB}}}} {\text{)}}$$
(2b)

∆G°AB is thus the y-intercept in Fig. 2a. In biochemistry, the standard state (∆G°’) is often defined at pH 7 not pH 0. ∆G°’ is provided in Table 2, and can be read off Fig. 2a or b from the y value for each state at pH 7. At pH 7 the relative free energies are:

Table 2 State energies derived from pKas in Table 1 using different reference states or reference pHs
$$\Delta {\text{G}}_{{{\text{AB}}}} ^{{ \circ ^{\prime}}} = \Delta {\text{G}}_{{{\text{AB}}}}^{7} = \Delta {\text{m}}_{{{\text{AB}}}} {\text{C}}_{{{\text{units}}}} (7 - {\text{pK}}_{{{\text{AB}}}} )$$
(2c)

The pKa at pH 5.61 connects state C to the reference state B. The pKBC is marked on Fig. 2a as the point where the two states have equal energy. As C has one less proton than B (∆mCB =  − 1) the free energy of C decreases relative to B with increasing pH, so its energy has a slope of − 1 (in pH units). Extrapolation of the free energy as a function of pH back to the reference pH of 0 yields the ∆G°CB of 5.61 (Eq. 2b).

While the pKas in Table 1 give the pairwise free energy difference between states A or C and B, there is no direct information about the transition between states B and D, which differ by 2 protons. The pKCD at 13.77 connects states C and D. Thus, ∆G°DC = ∆mDC(− pKDC) = 13.77. ∆G°DB = ∆G°DC + ∆G°CB (i.e. the free energy change from B to C plus that from C to D) (Table 2). The slope of ∆GDB is − 2 as D has 2 fewer protons than B. The slope of ∆GDC with pH is − 1, which is not easy to see from the graph, but will become apparent in the next section when C is used as the reference state. At each pH the predominant species will be the state at lowest energy shown by thicker lines in Fig. 2a. Thus, below pH -2.17 this is state A; between − 2.17 and 5.61 it is state B; and above 5.61 it is state C and above pH 13.77 D is the lowest energy and thus the predominant species.

Translating the pKas, which each connect a pair of microstates, into relative ∆G° for the ensemble of four microstates provides additional information. Thus, the free energy difference between states not connected by a defined pKa, such as a two proton transition between A and D, can be obtained from the sum of stepwise ∆Gs at any pH. The crossing points between any pair of lines on the free energy vs pH plot show the pH where two microstates have equal energy and thus equal probability.

The selection of the reference state is arbitrary. Figure 2b shows the graphical analysis of the same pKas shown in Table 1 but with state C as the reference, instead of B. Now C lies along the horizontal at ∆G = 0. B has one more proton than C so the pH dependence of ∆GBC has a slope of 1 and ∆G = 0 at pKBC. State D has one less proton than C so ∆GDC changes with pH with a slope of − 1. ∆GDC is 0 at pH 13.77, at pKDC. Now it is the pKAB at − 2.17 that is not directly connected to the reference state. ∆G°AC is ∆G°AB + ∆G°BC (Tables 1, 2). The two graphs in Fig. 2a and b are the same except for a rotation to move from B being on the x axis to place C on this axis. For any microstate (j) ∆G°jB and ∆G°jC differ by the difference in energy between the states B and C (∆G°CB), which is − 5.61 at pH 0. As the relative energy difference between all states are the same at each pH the lowest energy (and hence highest population) state at each pH is the same in Fig. 2a and b.

Given the relative energy at pH 0, the relative energy of each state can be determined at any pH by:

$$\Delta {\text{G}}_{{{\text{jB}}}}^{{{\text{pH}}}} = \Delta {\text{G}}_{{{\text{jB}}}} ^{ \circ } + \Delta {\text{m}}_{{{\text{jB}}}} {\text{C}}_{{{\text{units}}}} \left( {{\text{pH}} - {\text{pH}}_{{{\text{ref}}}} } \right)$$
(3)

Given the energy as a function of pH (∆GpHjB) the fraction of each state, Nj, at each pH is obtained from the standard expression:

$${\text{N}}_{{\text{j}}}^{{{\text{pH}}}} = \frac{{10^{{ - \Delta {\text{G}}_{{{\text{jB}}}}^{{{\text{pH}}}} }} }}{{\sum _{{\text{i}}} 10^{{ - \Delta {\text{G}}_{{{\text{iB}}}}^{{{\text{pH}}}} }} }}$$
(4)

Plotting NpHj vs. pH provides the titration curve. The crossing points of titration curves recover the initial, input pKas (Fig. 2c).

Microstate analysis of SM07

In the microscopic analysis of SAMPL6 pKa challenge target SM07, eleven microstates were enumerated (Fig. 1). There were 32 blind submissions of microscopic pKa predictions from eight laboratories [16, 21, 24]. Four research groups submitted a single set of predictions, two submitted 2 distinct sets of predictions, another submitted 10 [19, 25], and a final group 14 [23]. As few as a single pKa was submitted for this compound (one prediction set) and as many as 17 pKas were reported (24 prediction sets). We will show how converting all the pairwise pKas to state ∆G°s will make it easier to compare the entire free energy landscape predicted by different methods, recognizing thermodynamic inconsistencies and ending with better appreciation of whether different calculation methods are converging to similar answers for states that are not experimentally accessible.

Choosing the reference state

We will first consider the values calculated with the program Epik [26,27,28]. From independent Epik calculations run with the -pH option at pH values between 2–12 (0.1 pH units apart), 8 microstates were predicted to be populated (Fig. 3b; Table 3) [24]. We will then describe the relative microstate energies obtained from the pKas in all submissions, which describe as many as 11 microstates (Fig. 1).

Fig. 3
figure 3

Graphical depiction of the microstate energy as a function of pH and resultant titration curve for the eight microstates of SM07 described in Table 3. a Graphical representation of the 8 microstate energies as a function of pH using ∆G°s and ∆ms from Table 3. The squares show pKas that would be seen experimentally as they connect the states that are at low energy at that pH and the triangles show pKas that were the input to the calculation. There is an inconsistency in the relative energy of states 2 and 3 calculated when state 12 or state 6 or 7 are used to obtain the free energy difference from reference state 4 (Table 3). b The microstate network of 8 microstates of SM07 connected by pKas calculated with Epik [24]. Microstates predicted by Epik are a subset of those shown in Fig. 1. Dark blue arrows are the ∆G° between the two microstates; Red arrows are ∆G° between tautomers. The standard deviations for ∆G°2,4 and ∆G°3,4 represent the standard error for the free energy calculated around the two nearest closed triangular loops. Green numbers under microstate identifiers are ∆G°, the free energy relative to state 4 at pH = 0..c The probability of each state as a function of pH. Note that while state 6 is the predominant microstate between pH − 5 and 5, a small amount of tautomer microstate 7 is seen. ∆G represents unitless free energies where a unit change in ∆G yields a tenfold population change. Python scripts and interactive Jupyter notebooks to generate networks and graphs of the relative free energy as a function of pH from a list of microscopic pKas can be found at https://github.com/choderalab/titrato

Table 3 (a) Epik microscopic pKas for SM07. (b) Epik SM07 microstate standard state free energies

Knowing the structure of each state we can count the number of protons (Fig. 1). As shown above the choice of reference state is arbitrary. One choice, which is easier to automate, is to use microstate 12 or 16 with the fewest or most protons (Fig. 1). However, we chose state 4 as the reference as it has four reported pKas. This allows the ∆G°j4 for microstates 6, 7, 14 and 12 to be determined directly with (Eq. 2b); Table 3). ∆G°14,4 is then the sum of ∆G°14,7 + ∆G°7,4.

Determining the ∆G° between tautomers uncovers a lack of thermodynamic consistency

Microstates 2, 3, and 4 are tautomers with the same number of protons. Thus, their relative energy is independent of pH, so there is no pH where their energy is equal and thus no pKa can be defined. However, examination of Fig. 3b shows the ∆G° can be defined between these states by the summed energy along any path to the reference. For example, to determine ∆G°2,4 we consider two short paths: one with state 6 as the intermediate, and one via state 12. The two paths give a different ∆G°2,4 (Table 3). This indicates that the closed reaction cycle from microstate 4 to 6 to 2 to 12 and back to 4 does not sum to zero as it should for a thermodynamically consistent method. The summed free energy for the protonation and tautomer changes are seen to be described as a closed loop when the molecule is described as a network of protonation and tautomeric microstates with energies defined against a single reference state. When different cycles do not sum to zero there are multiple choices for the derived ∆G° values, from using one cycle, averaging the 2 shortest cycles as carried out here, to averaging the results from all possible cycles. When the thermodynamic cycles close properly there is no ambiguity in the relative free energy of the microstates.

Graphical analysis of the microstate energy as a function of pH.

Figure 3a provides a graphical picture of the microstate energy as a function of pH obtained from the pKas in Table 3. Plotted relative microstate free energy vs. pH shows the energy of each of the two groups of tautomers (1H macrostate, microstates 2, 3, 4) and (2H macrostate, microstates 6, 7, 11) are parallel to each other as the ∆G between them is independent of pH. The graph shows which state(s) are at experimentally accessible energy in any pH range. This is microstate 14 at low pH, a mixture of 6 and 7, then microstate 4 and at high pH microstate 12 predominates.

The nine pKas that are given each represent crossing points between two lines on the graph. There are many other possible pKas that can be read off the graph or obtained by determining the pH where two microstates with different numbers of protons have the same energy. For example, pK7,14 is given but pK6,14 might be more important as microstate 6 is at lower energy than the tautomer microstate 7. In addition, inconsistencies in the energies obtained for different pairs of pKas are seen. Thus, the intersection of states 2 and 6 as well as that of 3 and 7 are different than the reported pKas because the y intercept (∆G°) represents the average of two thermodynamic cycles while the slope of the lines for 2 and 3 must be fixed at 0 as these microstates have the same number of protons as reference state 4, or the slope is 1 if the microstate has an additional proton (e.g. for 6,7, and 11). The derived titration curve highlights the low energy, experimentally accessible states. It should be noted that the titration covers a pH range that is higher and lower than most experiments. The 2H protonation state is found to be the most stable between pH -5 and 5. While tautomer 6 is dominant, the free energy analysis shows tautomer 7 is close in energy so would be predicted to be a minority species. The relative probability of microstates 6 and 7 is pH independent and their sum gives the probability of the 2H macrostate as a function of pH.

ECRISM-13 (SAMPL6 pKa challenge submission ID 0xi4b) reported 17 pKa values for SM07 (Fig. 4) [23]. The network obtained from these pKas show all closed paths around the network have a summed ∆G° of zero, indicating the reported values are thermodynamically consistent (Fig. 4a). Now there are predictions for the free energy of tautomers 13, 14 and 15 as well as the highly protonated microstate 16. The pattern of relative microstate energies are in qualitative agreement with the Epik simulations and the resulting titration is also similar. The same microstates are at low energy (Fig. 4c). Both calculations place the 2H microstates 6 and 7 close enough in energy that a mixture of tautomers are predicted to be seen (Figs. 3c, 4c).

Fig. 4
figure 4

Graphical analysis of a network of 11 microstates SM07. a ∆Gj4 as a function of pH for all microstates. ∆G°j4 is ∆Gj4 at pH 0. ∆G represents unitless free energies where a unit change in ∆G yields a tenfold population change. b Network of 11 SM07 microstates considered by calculation ECRISM-13 (ID 0xi4b) [23], the resultant pairwise ∆G°j4 (blue arrows) and microstate ∆G°j4 relative to microstate 4 (green numbers). c Predicted titration curves given the microstate ∆Gs as a function of pH. It should be noted that the pKas shown in Fig. 4c match the crossing points in Fig. 4b that occur as one lowest energy state is replaced by another as the pH changes

Overview of all submitted predictions for SM07: Do different calculation methods give similar values for ∆G°?

Table 4 gives the ∆G° for all predictions of all microstates of SM07 with microstate 4 as the reference state and pH 0 the reference pH. It is far more compact than a table of pKas, requiring only a single value for each microstate. In contrast, a single microstate in the SM07 network of states described in Fig. 2, can be connected by as many as six pKas to the six microstates that different by one proton. In addition, the pH independent ∆G° between tautomeric states are established, although there is no pKa that can be defined for a pair of microstates with the same number of protons. As shown in Figs. 2,3 and 4, knowledge of ∆G°j4 at a single pH and the difference in the number of protons from the reference state (∆mj4) can be used to find the microstate free energy differences at all pHs. The analysis shows where any two microstates are at the same energy either graphically or by calculation, and so identifies all possible microscopic pKas. The pKas can connect microstates at high energy or that differ by more than one proton. Knowing the microstate energies as a function of pH allows the calculation of their relative probability with pH. Plotting this probability as a function of pH generates a titration curve, visually identifying the low energy states and providing the macroscopic pKas (Figs. 2,3,4). Table 4, gives the ∆G°s at the reference pH of 0, although the table can be modified for any pH using (Eq. 3) (e.g. Table 2, 3).

Table 4 Microstate ∆G°i4 for SM07 derived from pKas submitted to the SAMPL6 Blind pKa Challenge shows areas of qualitative agreement with significant differences in calculated energies

Comparison of the calculated results with experiment.

A single experimental pKa value at pH 6.08 is available for SM07 [5, 17]. SM07 was one of the few whose titration was followed by NMR showing the transition is between microstates 4 and 6. The NHLBI QM submissions (ko8yx, w4z0e, wcvnu, arcko, wexjs) [19] and the Fraczkiewicz submission (hdiyq) as well as the single KirilLanevskij submission (v8qph) predicts both the correct low energy microstates and the pKa correctly (with a maximum error of 1 pH unit).

The ∆G° analysis gives access to the pH independent ∆∆G between tautomers, which can be compared with the experimental evidence for the transition between the 1H and 2H microstates described by experiment. All calculations put microstate 4 as the lowest energy 1H state, in agreement with experiment. However, the free energy of microstate 7 is often very close to that of 6, so it is often predicted to be a minority species in the titration (Figs. 3c, 4c). It should be noted that in several cases the microscopic pKas between microstates 7 and 4 is close to the experimental value. However, macroscopic titration will always predominantly involve microstate 6 as it is at lower energy.

The thermodynamic consistency of the submitted predictions for SM07

Viewing the molecule as a network of connected protonation and tautomeric states allows the self-consistency of the relative energies to be determined. If the sum of the ∆G° around a closed path deviates from zero by more than the likely error of the individual values, then this group of microstate energies are not thermodynamically consistent and something is wrong. We can see that different submissions have different degrees of internal constancy. The ∆G° were summed along all cycles of length 4 in the graph of SM07 microscopic equilibria for each of the submissions. Table 4 gives the largest value of the summed ∆G° for each prediction set (∆Gcycle).

The submissions from ECRISM (kxztt, ftc8w, ktpj5, wuuvc, 2umai, cm2yq, z7fhp, 8toyp, epvmk, xnoe0, 4o0ia, nxaaw, 0xi4b, cywyk) [23] have closed free energy cycles, with the exception of rounding errors on the second decimal, as does the Fraczkiewicz submission (hdiyq). The NHLBI QM submissions (ko8yx, w4z0e, wcvnu, arcko, wexjs) [19] have closed cycles for some but not every 4-microstate cycle, but the mismatches are all below 1 ∆pH unit. In contrast, NHLBI submissions using QM-MM (0wfzo, z3btx, 758j8, hgn83) [25], do not produce thermodynamically consistent cycles, with inconsistancies of around 8 ∆pH units for the cycle containing microstates 4, 7, 15, and 11. The submission that used the Bannan OE method (6tvf8) [16] have cycles that do not sum to 0, with the largest cycle error being 8.75 ∆pH units for the cycle between microstates 4, 11, 13, and 6.

It should be noted that the ∆G°j4 connects the 1H microstate 4 to other microstates. As described in Table 2b the energy of tautomers is derived from the sum of free energies along a thermodynamic path and the energy of states that are separated from microstate 4 by more than 1 proton (3H and 4H microstates for SM07) are obtained by sums of ∆G°s along the path to the reference state 4. When the cycles do not close than the values in Table 4 become dependent on which path is used or if multiple paths are averaged.

Overview of the SM07 landscapes show qualitative consistency, but large differences in values.

Only one experimental value is available for SM07. Under these circumstances simulations can offer information if the calculations can be vetted in some manner. One check is the ability to match the single known pKa, identifying the correct macrostates (1H to 2H here) and correct microstates (4 and 6). Another check is that the overall network is thermodynamically consistent. Lastly, we might say that the calculation of pKas for molecules such as SM07 is ‘solved’ if the various submissions find similar answers for the relative state energies that can be checked against experiment for the few microstates that are experimentally accessible.

Table 4 allows evaluation of the consistency of the lowest energy microstate at the reference pH. Here this is pH 0, but the table to be remade at any pH (Eq. 3). It is apparent that at pH 0 the calculations do not agree on what is the lowest energy protonation state. It can be the 1H state (NHLBI-6 to 9), the 2H state or a 3 H state. Thus, the calculations do not agree on what is the net charge of the SM07 molecule at the reference pH.

Another comparison amongst the calculations is to compare the ∆G°j4 of individual microstates. The overall range of energy from the microstate with no protons (microstate 12) to that with 4 protons (16) varies enormously between the different calculations from − 7.6 for PCM-1 to + 52 ∆pH units for NCBLI-9 at pH 0. The calculations which are not thermodynamically consistent (NHBLI 6,7,8,9 (on the left in Fig. 5) and PCM (on the right) are clearly different from the bulk of the calculations. The thermodynamically consistent networks still have a range of energy from the most to least protonated microstates of 20 to -7.6 ∆pH units. As one unit of energy is sufficient to change the relative population by tenfold, this represents a large variation. Thus, while the ∆G°12,16 may not be significant experimentally, the difference in this value shows that the free energy landscape for this molecule is predicted to be radically different in the different calculations. The array of different values shows how SAMPL challenges allow the strengths and weaknesses of different computational to be seen. Outliers have much to teach us.

Fig. 5
figure 5

Overview of relative free energies for individual microstates for individual submissions to the SAMPL6 blind pKa challenge. All submissions that provided information about all 11 microstates are included, ordered by the free energy difference between the 4H microstate (16) and the 0H microstate (12). Data from Table 4; definition of microstates from Fig. 1. a Blue: microstate 16, Red: 12; b Green: microstate 2; Black: 3; Red: 4 (reference state); c Green: microstate 11; Black: 7; Red: 6; d Green: microstate 15; Black: 14; Red: 13

The relative tautomer energies are, perhaps, a simpler test of the various calculation methods, as the compounds have the same net charge so there is likely to be a smaller difference in solvation energy or influence by the uncertainties in the calculation of the energy of the free proton. The relative energy of each set of tautomers is independent of pH. All calculations show microstates 2 and 3 of SM07 to be close in energy and at higher energy than microstate 4 (Fig. 5b). Likewise, microstates 6 and 7 are generally close in energy with 6 being the lower and state 11 being significantly higher in energy (Fig. 5c). For the tautomers with three protons (microstates 13,14 and 15) microstate 14 is predicted to be the lowest energy microstate, but there is less agreement about the relative energy of these tautomers (Fig. 4d). Thus, overall comparison of the thermodynamically consistent networks of ∆G°s the relative tautomer free energies are in qualitative agreement.

Conclusion

The protonation and tautomer states of extended organic molecules will significantly influence their solubility, partition coefficients and binding affinities to biologically important macromolecules. Molecules used as drugs often have multiple protonation and tautomeric states. Thus, we need to be able to organize the information about the molecular macrostates (with different charge) and microstates (defining the position of all protons) so that under any set of conditions we can determine the dominant charge of the compound and its likely tautomeric state. The question addressed here is how to best organize the information we have about these complex molecules.

The SAMPLE6 Blind pKa Challenge was the first SAMPL challenge directly focusing on the ability of simulation to predict pKas of complex organic molecules [5]. Evaluating the submissions made it clear that the best way to describe the pH dependence of molecules with multiple protonation and tautomeric states was not a solved problem. It proved to be difficult to compare different calculations with each other using the complex lists of microstate pKas. The work presented here shows that reporting only the free energy at a single pH, ∆G°, and change in the number of protons, ∆m, each with respect to one (arbitrary) microstate is a better way to report information about protonation and tautomeric states. This procedure should be used for future SAMPL challenges, but it should also be useful as a general way to archive information about molecules with multiple protonation and tautomeric states more generally. It should be noted that this paper shows in detail how to back calculate all the microstate ∆G°s from a list of submitted pKas. However, computer simulations will often calculate relative microstate ∆G°s, which were then submitted as pKas.

There are a number of significant advantages to listing microstate ∆G°s for a network of states rather than a list of pairwise pKa between specific states. The list of ∆G°s is more compact. Thus, for the SM07 microstates considered here, 11 ∆G°s and ∆ms provide all needed information to determine the free energy difference between any pair of states. In contrast, there are 24 pKa that connect only states that differ by one proton. The information provided by the ∆G°s is richer. It provides the ∆G°s between tautomers, which is never evident from lists of pKas as the free energy difference between molecules with the same number of protons is pH independent (Fig. 5b,c,d). The relative energy of microstates that differ by more than 1 proton is clearly defined (Fig. 5a). The summed free energy around closed cycles in the network of microstates can be checked for thermodynamic consistency. As in any equilibrium system, knowledge of the free energy of all states determines the population of each state in the ensemble (Eq. 4). Knowing standard state ∆G°s and that the free energy varies linearly with pH with a slope of that reflects the change in the number of protons relative to the reference state (∆m) directly provides the relative free energy of all states at all pHs (Eq. 3).

The ensemble of all microstate ∆G°s allows the calculations derived by all methods including empirically based methods such as machine learning and QSAR to be compared with each other to determine where all methods agree (Fig. 5). In situations where experimental data is unavailable (and likely to remain so) convergence of values calculated in different ways lends support to the answers obtained by the simulations. The agreement can be qualitative, as often is here, where the ordering of the lowest to highest energy tautomeric state is the same for all calculations. But the numerical free energy differences between states can vary significantly showing that there is more work to be done for these simulation methods to be able to reliably substitute for experimental measurements.

If this round of calculations does lead to a second prediction challenge for pKas, we would strongly suggest that only microscopic data be reported; that this should be given as the standard state ∆G°; and that only thermodynamically consistent networks of ∆G°s be submitted.