Introduction

Calculations of acidity constants, Ka, are important not only for practical purposes but also serve as important benchmarks for testing solvation models used in conjunction with quantum-chemical calculations, particularly in an aqueous environment [1]. Without loss of generality, and specializing to water as a solvent, the constitutive reaction equation

$$ {\mathrm{H}\mathrm{A}}_{\mathrm{aq}}\to {\mathrm{H}}_{\mathrm{aq}}^{+}+{\mathrm{A}}_{\mathrm{aq}}^{-} $$
(1)

is characterized thermodynamically by the relation between equilibrium constant, activities a, standard Gibbs energy of reaction ΔrG0 and standard chemical potentials μ0 via

$$ -{\beta}^{-1}\ln {K}_{\mathrm{a}}=-{\beta}^{-1}\ln \frac{a\left({\mathrm{H}}_{\mathrm{a}\mathrm{q}}^{+}\right)a\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)}{a\left({\mathrm{H}\mathrm{A}}_{\mathrm{a}\mathrm{q}}\right)}={\Delta}_{\mathrm{r}}{G}^0={\mu}^0\left({\mathrm{H}}_{\mathrm{a}\mathrm{q}}^{+}\right)+{\mu}^0\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)-{\mu}^0\left({\mathrm{H}\mathrm{A}}_{\mathrm{a}\mathrm{q}}\right) $$
(2)

where β is an inverse temperature. The standard chemical potentials in solution are commonly referenced to a standard state of 1 bar and a formal concentration of c0 = 1 M at the specified temperature (hereinafter assumed to be 298.15 K) under the assumption of infinite dilution.

Quantum calculations of these quantities, for instance by employing a continuum solvation approach, usually model such an ideal solution state by construction, approximating the standard chemical potential of a compound i in a given, fixed conformational and tautomeric state j, for instance, by [2,3,4,5].

$$ {\mu}_j^0(i)\approx {\mu}_j^{0,\mathrm{id}}(i)+{E}_j^{0,\mathrm{sol}}(i)+{\mu}_j^{0,\mathrm{ex}}(i)+{G}_j^{0,\mathrm{RRHO}}(i)\equiv {\mu}_j^{0,\mathrm{id}}(i)+{G}_j^0(i). $$
(3)

It is given by the sum of an ideal (“id”) part, which contains the explicit reference to the standard concentration to be specified below, and an interaction component, termed \( {G}_j^0(i) \) here. The latter can be approximated by adding an electronic energy in solution, \( {E}_j^{0,\mathrm{sol}}(i) \), an excess chemical potential, \( {\mu}_j^{0,\mathrm{ex}}(i) \), that represents the Gibbs energy of solvation upon transferring a solute in the “frozen” structural and electronic solution state from the (ideal) gas phase into the solvent (assuming identical formal gas and solution phase concentrations), and potentially a “rigid rotor, harmonic oscillator” (RRHO) model of rotational and vibrational contributions to the Gibbs energy. As the standard condition of infinite dilution is implicitly assumed by constructing the Hamiltonian, the superscript “0” at the interaction terms \( {G}_j^0(i) \) can be dropped for simplicity.

For treating multistate species comprising an ensemble of distinct tautomeric and conformational states, two strategies are available. One approach is to sum over states by defining, in a more or less ad hoc manner, a canonical partition function while ignoring pressure-volume contributions [3,4,5], to end up for the protonation equilibrium with

$$ {\Delta}_{\mathrm{r}}{G}^0={\mu}^{0,\mathrm{id}}\left({\mathrm{H}}_{\mathrm{aq}}^{+}\right)+{\mu}^{0,\mathrm{id}}\left({\mathrm{A}}_{\mathrm{aq}}^{-}\right)-{\mu}^{0,\mathrm{id}}\left({\mathrm{H}\mathrm{A}}_{\mathrm{aq}}\right)+{\mu}^{\mathrm{ex}}\left({\mathrm{H}}_{\mathrm{aq}}^{+}\right)-{\beta}^{-1}\ln \frac{\sum \limits_{j=1}^M\exp \left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{aq}}^{-}\right)\right]}{\sum \limits_{k=1}^N\exp \left[-\beta {G}_k\left({\mathrm{H}\mathrm{A}}_{\mathrm{aq}}\right)\right]} $$
(4)

where we sum over M base and N acid states. The fourth term on the right hand side (r.h.s.) represents the Gibbs energy of hydration of the “proton” (again assuming identical gas phase and solution state concentrations) [6, 7] and otherwise only ideal terms that are usually assumed to be an additive constant. Therefore, on the decadic pK scale we finally obtain the expression for the partition function (PF) approach,

$$ \mathrm{p}{K}_{\mathrm{a}}^{\mathrm{PF}}=\frac{\beta {\Delta}_{\mathrm{r}}{G}^0}{\ln 10}=b-\frac{m}{\ln 10}\ln \frac{\sum \limits_{j=1}^M\exp \left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)\right]}{\sum \limits_{k=1}^N\exp \left[-\beta {G}_k\left({\mathrm{HA}}_{\mathrm{a}\mathrm{q}}\right)\right]} $$
(5)

with

$$ b=\frac{\beta }{\ln 10}\left[{\mu}^{0,\mathrm{id}}\left({\mathrm{H}}_{\mathrm{aq}}^{+}\right)+{\mu}^{0,\mathrm{id}}\left({\mathrm{A}}_{\mathrm{aq}}^{-}\right)-{\mu}^{0,\mathrm{id}}\left({\mathrm{H}\mathrm{A}}_{\mathrm{aq}}\right)+{\mu}^{\mathrm{ex}}\left({\mathrm{H}}_{\mathrm{aq}}^{+}\right)\right], $$
(6)

where the terms assumed to be constant in total (i.e., ideal gas and proton) are contained in b and, additionally, computational flexibility is offered by introducing a parameter m, which, ideally, is 1. The parameters m and b are typically adjusted by fitting to experimental reference data, as was done by us [3, 4, 8] and others [5, 9, 10] (the latter reference also representing an early example of a PF-type treatment), to name just a few.

The alternative is to connect all base and acid states by individual transition equilibria as

$$ {\mathrm{H}\mathrm{A}}_{\mathrm{aq},k}\to {\mathrm{H}}_{\mathrm{aq}}^{+}+{\mathrm{A}}_{\mathrm{aq},j}^{-} $$
(7)

for which straightforward reduction of Eq. (5) would give

$$ \mathrm{p}{K}_{a, jk}=b+\frac{m\beta}{\ln 10}\left[{G}_j\left({\mathrm{A}}_{\mathrm{aq}}^{-}\right)-{G}_k\left({\mathrm{HA}}_{\mathrm{aq}}\right)\right]. $$
(8)

The individual state-to-state equilibrium constants can then be assembled to yield the macroscopic form from mass balance,

$$ {K}_{\mathrm{a}}=\frac{a\left({\mathrm{H}}_{\mathrm{a}\mathrm{q}}^{+}\right){\sum}_{j=1}^M{c}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)/{c}^0}{\sum_{k=1}^n{c}_k\left({\mathrm{H}\mathrm{A}}_{\mathrm{a}\mathrm{q}}\right)/{c}^0}, $$
(9)

as [11].

$$ {K}_{\mathrm{a}}^{\mathrm{ST}}={\sum}_{j=1}^M\frac{1}{\sum_{k=1}^N\frac{1}{K_{a, jk}}} $$
(10)

where “ST” denotes the state transition approach. Here, in contrast to the analysis by Bochevarov et al. [11], who distinguish between “micro- “(i.e., tautomeric) and “nano- “(i.e., conformational) states, which leads to another layer in the continued fraction expansion, we need no such discrimination as the concepts of tautomers and (underlying) conformers is purely semantic, though pragmatically useful for certain models [11]. Physically, tautomers and conformers for a certain ionization state refer to different local minima of the (free) energy surface derived from one and the same molecular Hamiltonian, whereas the Hamiltonians of acid and base forms differ. Hence, we simply refer to “states” between which transitions can occur in the equilibrium mixture.

Though plausible, as the information content of both the PF and the ST approaches in terms of the state-specific Gibbs energies is identical, it is not immediately obvious under which circumstances both methods yield identical results. The goal of the present work was therefore to elucidate the formal equivalence both analytically and numerically. As will be shown below, both methods agree only in the limiting cases of a regression “slope” parameter m being exactly 1, i.e., for ideal (and usually inapplicable) models that do not require any form of empirical scaling of energies, or, trivially, in situation where only single acid and base states are considered. Numerically, this conclusion will be illustrated and discussed by re-analysis of our previous results obtained during the recent SAMPL6 (“Statistical Assessment of the Modeling of Proteins and Ligands”) challenge [4, 12] on blindly predicting aqueous pKa values for a number of kinase inhibitor fragments with multiple protonation states and considerable conformational flexibility.

Theory

Formal correspondence of the PF and ST approaches can be proved if the mass balance equation leading to the continued fraction representation (10) can be derived on the same statistical-mechanical footing as Eq. (5) and its reduction to Eq. (8). We therefore start with the fundamental expression for the chemical potential of a molecule (omitting index i for notational simplicity) composed of distinct states j such that the approximation (3) and the assumption of negligible pressure-volume work hold,

$$ \mu =-{\beta}^{-1}\ln \frac{V}{\varLambda^3{N}_{\mathrm{M}}}-{\beta}^{-1}\ln {\sum}_j\exp \left[-\beta {G}_j\right]={\mu}^{\mathrm{id}}-{\beta}^{-1}\ln Z\equiv {\mu}^{\mathrm{id}}+G. $$
(11)

Here, G is the excess (interaction) part of the total chemical potential as in Eq. (3), V represents the volume, NM is the number of solute molecules which is 1 at infinite dilution, Z is the partition function, and Λ denotes the thermal wavelength given by

$$ \varLambda ={\left(\frac{\beta {h}^2}{2\pi {m}_{\mathrm{M}}}\right)}^{1/2} $$
(12)

with molecular mass mM and Planck’s constant h. The statistical-mechanical chemical potential should be equivalent to the macroscopic thermodynamic definition

$$ {\displaystyle \begin{array}{c}\mu ={\mu}^0+{\beta}^{-1}\ln a={\mu}^0+{\beta}^{-1}\ln \left(\gamma c/{c}^0\right)\\ {}={\mu}^0+{\beta}^{-1}\ln {\sum}_j{\gamma}_j{c}_j/{c}^0\underset{c\to 0}{=}{\mu}^0+{\beta}^{-1}\ln {\sum}_j{c}_j/{c}^0\end{array}} $$
(13)

with total solute concentration c, split into state contributions cj according to mass balance, and activity coefficients γj that approach 1 at infinite dilution. By noting that the probability of a state j can be written as

$$ {p}_j=\frac{\exp \left[-\beta {G}_j\right]}{Z}=\frac{c_j}{c}=\frac{c_j}{c}\frac{c^0}{c^0} $$
(14)

where we inserted 1 = c0/c0, and inserting 1 = ∑jpj as denominator in G of Eq. (11) we have

$$ {\displaystyle \begin{array}{c}G=-{\beta}^{-1}\ln \frac{\sum_j\exp \left[-\beta {G}_j\right]}{\sum_j{p}_j}=+{\beta}^{-1}\ln {\sum}_j\frac{p_j}{Z}=+{\beta}^{-1}\ln \left(\frac{c^0}{Zc}{\sum}_j\frac{c_j}{c^0}\right)\\ {}={\beta}^{-1}\ln \frac{c^0}{Zc}+{\beta}^{-1}\ln {\sum}_j\frac{c_j}{c^0}.\end{array}} $$
(15)

We recover the concentration dependence of (13) as the last term on the r.h.s., and the standard chemical potential therefore becomes

$$ {\mu}^0={\mu}^{\mathrm{id}}+{\beta}^{-1}\ln \frac{c^0}{Zc}=-{\beta}^{-1}\ln \frac{ZVc}{c^0{\varLambda}^3} $$
(16)

which, by also noting that c → 1/V at infinite dilution, finally yields

$$ {\mu}^0=-{\beta}^{-1}\ln \frac{Z}{c^0{\varLambda}^3}=-{\beta}^{-1}\ln \frac{\sum_j\exp \left[-\beta {G}_j\right]}{c^0{\varLambda}^3}. $$
(17)

For the protonation equilibrium in the partition function derivation we then ultimately find from inserting the expressions for the standard chemical potential for the reacting species into Eq. (2) and taking the negative decadic logarithm

$$ \mathrm{p}{K}_a=-\frac{1}{\ln 10}\ln \frac{\varLambda^3\left(\mathrm{HA}\right){\left({c}^0\right)}^{-1}}{\varLambda^3\left({\mathrm{A}}^{-}\right){\varLambda}^3\left({\mathrm{H}}^{+}\right)}+\frac{1}{\ln 10}{\mu}^{\mathrm{ex}}\left({\mathrm{H}}_{\mathrm{aq}}^{+}\right)-\frac{1}{\ln 10}\ln \frac{\sum_{j=1}^M\exp \left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{aq}}^{-}\right)\right]}{\sum_{k=1}^N\exp \left[-\beta {G}_k\left({\mathrm{H}\mathrm{A}}_{\mathrm{aq}}\right)\right]}. $$
(18)

Comparison with Eq. (5) shows that both relations are equivalent (for m = 1), showing that mass balance leads directly to the partition function approach. It is, however, important to note that the regression intercept b, i.e., the sum of the first two terms in the latter equation is actually not a constant as it depends not only on the mass of proton but also on the mass ratio of acid and base forms via the thermal wavelengths, though not on the particular state. Unless the mass-dependent terms are grouped with the Boltzmann factors the intercept can be interpreted as essentially constant only in the limit of much larger molecular mass of the compound compared to the proton, which, however, holds true in most situations. In this limit, the first term becomes −4.39 kcal mol−1 (see also [13]) compared to the much larger Tissandier value for the Gibbs solvation energy of the proton of −265.89 kcal mol−1 [6] (assuming identical gas and solution phase concentrations). For, e.g., HF, the first quantity would change by 7.5%, corresponding to ca. 0.24 pK units. Very accurate calculations should, therefore, take this effect into account. Note that this result holds not only within the quantum-statistical formalism invoked for the chemical potential, but also in a classical framework, where integration over momentum space yields the identical dependence of the standard chemical potential on molecular mass and standard concentration (see Eq. (8) in [14]).

To close the proof of equivalence, mass balance also leads to the continued fraction expansion (10) where Eq. (8) can be inserted to show under which conditions Eqs. (5) and (18) arise. Mass balance according to Eq. (9) for the protonation equilibrium readily leads to the continued fraction expansion (10) as derived in [11]. Rewriting Eq. (8) on the energy scale and inserting into (10) yields

$$ {K}_{\mathrm{a}}^{\mathrm{ST}}={\sum}_{j=1}^M\frac{1}{\sum_{k=1}^N\frac{1}{K_{a, jk}}}={\sum}_{j=1}^M\frac{1}{\sum_{k=1}^N{\left(\frac{\exp \left[-\beta {G}_k\left({\mathrm{A}\mathrm{H}}_{\mathrm{a}\mathrm{q}}\right)\right]}{\exp \left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)\right]}\right)}^m{10}^b}. $$
(19)

In the innermost sum, the j-dependent denominator is constant for all k such that we obtain

$$ {\displaystyle \begin{array}{c}{K}_{\mathrm{a}}^{\mathrm{ST}}={\sum}_{j=1}^M\frac{1}{\frac{\sum_{k=1}^N\exp {\left[-\beta {G}_k\left({\mathrm{A}\mathrm{H}}_{\mathrm{a}\mathrm{q}}\right)\right]}^m}{\exp {\left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)\right]}^m}{10}^b}={\sum}_{j=1}^M\frac{\exp {\left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)\right]}^m\;{10}^{-b}}{\sum_{k=1}^N\exp {\left[-\beta {G}_k\left({\mathrm{A}\mathrm{H}}_{\mathrm{a}\mathrm{q}}\right)\right]}^m}\\ {}={10}^{-b}\frac{\sum_{j=1}^M\exp {\left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)\right]}^m}{\sum_{k=1}^N\exp {\left[-\beta {G}_k\left({\mathrm{A}\mathrm{H}}_{\mathrm{a}\mathrm{q}}\right)\right]}^m}\end{array}} $$
(20)

in the ST form. In contrast, the corresponding PF result derived from Eq. (5) reads

$$ {K}_{\mathrm{a}}^{\mathrm{PF}}={10}^{-b}\;{\left(\frac{\sum_{j=1}^M\exp \left[-\beta {G}_j\left({\mathrm{A}}_{\mathrm{a}\mathrm{q}}^{-}\right)\right]}{\sum_{k=1}^N\exp \left[-\beta {G}_k\left({\mathrm{HA}}_{\mathrm{a}\mathrm{q}}\right)\right]}\right)}^m $$
(21)

which clearly shows that both expressions can only be identical if either m = 1 for multistate mixtures or if only one state per acid and base form exists while b is unaffected.

Numerical illustration

To demonstrate the effect of slope parameters m ≠ 1 on the relative performance of both the PF and ST models for a realistic prediction problem, here we re-analyze training and test set data obtained during the SAMPL6 challenge [4], where the PF model was employed exclusively. Briefly, we tested EC-RISM [2] theory for treating aqueous solvation in conjunction with quantum-chemical calculations, and were able to show that root mean square errors (RMSE) of ca. 1.0 pK units could be achieved for a well-known training set [15]. About the same error (1.1 pK units) was obtained for the independent test set composed of kinase inhibitor fragments, whose microstates were provided as part of the SAMPL6 challenge where the task was to blindly predict their pKa values. The challenge was explicitly designed to cover molecules with multiple protonation sites, ionization states, and high conformational freedom, which necessitated adequate conformational sampling based on a large reference set of tautomers provided by the organizers. It is therefore not immediately clear that the PF and the ST approaches should perform similarly, as a non-unity slope together with large state ensembles suggests discrepancies (see derivation and discussion above).

We confine our re-analysis to the best-performing quantum-chemical level of theory and solvation model, termed “MP2/6-311+G(d,p)/φopt/copt2” in [4], where we used the two best-ranked conformations per tautomeric state (“copt2”) with an optimized model to compute electrostatic solute-solvent interactions (“φopt”) in combination with the 6-311+G(d,p) basis set within MP2 calculations. As a consistency check, besides the “2par” regression models (m and b variable), we additionally tested the PF and the ST models with a fixed slope of m = 1 (“1par”), not only to demonstrate the resulting equivalence, but also to analyze the impact on predictive performance. All statistical regression and metrics data are found in Tables 1 (training set) and 2 (test set), while the individual correlations of calculated and experimental data for the various methods are depicted in Fig. 1. Note that, unlike the linear regression problem of the PF approach, the ST model requires nonlinear optimization of a loss function defined by the sum of squared residuals.

Table 1 Parameters of optimized embedded cluster reference interaction site model (EC-RISM-) based aqueous pKa models for the training set along with statistical metrics [root mean square error (RMSE), mean absolute error (MAE), mean signed error (MSE), slope m, intercept b, and coefficient of determination R2 from predictive regression]. “PF/2par” represents metrics reported in [4]
Table 2 Statistical metrics for pKa predictions on the test set (RMSE, MAE, MSE, slope m’, intercept b’, and coefficient of determination R2 from descriptive regression) for various models. “PF/2par” represents metrics reported in [4]
Fig. 1a–d
figure 1

Embedded cluster reference interaction site model (EC-RISM-)derived vs. experimental pKa for training [15] (top row) and test set [4] (bottom row) calculated with either the partition function (PF; blue) or the state transition (ST) approach (red, mostly overlaying blue symbols). The charge state of the molecular species involved in the transition is reflected by different symbols: squares acids (training set)/anionic transition (test set), triangles bases (training set)/cationic transition (test set). Anionic and cationic refer to the sum over the charges of all species involved in a reaction which is negative or positive, respectively. Pairs of calculated and experimental data points are collected in Online Resource 1, raw data underlying the regression analysis was taken from Online Resources to [4]. a, c using m as a free parameter and b, d: using a fixed m = 1.

Training both models in 1par and 2par variants showed the expected results. While the trained parameters and the resulting statistical metrics are identical in the 1par case, as mathematically required, there is a small but almost negligible difference in the results for the 2par models, which is mainly a result of the limited amount of tautomeric and conformational freedom in the training data set (see Online Resources in [4]). The same holds true when applying the trained models to the test dataset from the SAMPL6 challenge. Despite the drastically differing diversity, the results are in line with the results from the training set. One has to keep in mind, though, that the 1par models are substantially inferior regarding performance (training set), and even more so in terms of predictivity (test set), which emphasizes the importance of scaling (free) energies by the slope parameter. Surprisingly, since the differences between PF and ST models are so small, it is in practice almost irrelevant which approach is preferred for acidity predictions.

Concluding remarks

In summary, we addressed a conceptual problem for practical pKa calculations that results from the necessity to include an energy scaling parameter m into the prediction model that is typically adjusted empirically. To this end, we derived the rigorous statistical-mechanical expressions for the acidity constants for two variants of multistate calculations, which revealed the source of an inconsistency when used within regression analysis.

From a mathematical perspective, the issue boils down to the inequality (x + y)m ≠ xm + ym for arbitrary m ≠ 1 where x and y represent non-zero Boltzmann factors of different tautomeric or conformational states of a given molecule. This finding is an example of a case in which formal equivalence of two approaches does not necessarily translate into equivalence in practical applications where numerical model adjustments turn out to be necessary. One might have expected significant differences between the partition function (l.h.s.) and the state transition (r.h.s.) approaches as the regression results indicate significant deviations from 1. However, for the training set the largest difference between PF and ST results is on the order of 0.1 pK units with our slope parameter of 0.74. Even with a smaller m of 0.5, this difference would not exceed approximately 0.2 pK units. This means that both models are quantitatively very similar for practical purposes, at least as long as a sufficiently accurate methodology is applied as in this work, and there is no obvious reason to prefer one method over the other.

Another result of the rigorous derivation was that the regression constant is actually variable, though with limited, and, in practice, mostly negligible range, as it depends, strictly speaking, on the mass ratio between acid and base form, which approaches unity only in the limit of large molecules. Taken together, these findings could be useful to the community as they clarify potential sources of controversy.