Introduction

The use of in silico modelling in rational design has become a popular and valuable tool in current research and development for agricultural, environmental, and pharmaceutical applications, as a multifaceted technique capable of providing rapid understanding to in situ phenomena that may be difficult to measure or study [1]. Computer-aided modeling is advantageous for forecasting how a molecule may react in different environments and is heavily utilized for virtual screening and lead optimization in drug discovery as a provisional method for physicochemical and biophysical characterization, including solubility, ionization, lipophilicity, etc. While there are many computational high-throughput models for predicting physicochemical properties, challenges persist for predictions of how molecules ionize in solution. The acid dissociation constant (Ka) or its corresponding logarithmic constant (pKa), is a quantitative measure of the strength of an acid in solution in the context of acid-base reactions related to the free energy \((\Delta {G_{{\text{aq}}}})\) of an acid losing a proton.

$${\text{p}}{K_a}=\frac{{\Delta {G_{{\text{aq}}}}}}{{RT\ln 10}}$$
(1)

Many methods for predicting pKa have been designed, spanning across electronic structure theory, molecular mechanics, and machine learning approaches [2,3,4]. Popular QSAR-style methods have been implemented in software packages, such as ADMET Predictor (S + pKa method [5]), Epik [6], pKa Prospector [7], and ACD/pKa Percepta Platform [8]. While these empirical methods can provide instantaneous predictions, inaccuracies arise for large and flexible molecules in which steric effects and microstate conformations surpass the Hammett–Taft approach [9].

A variety of semi-empirical and quantum chemical approaches have been developed—varying by not only the level of theory, but also by the solvation model and the reaction scheme [10,11,12,13]. For semi-empirical approaches, Jensen et al. considered several combinations of semi-empirical methods and implicit solvation models to predict the pKa of 48 druglike molecules using a relative pKa calculation scheme [14]. From the evaluation of six semi-empirical methods, the AM1 and PM3 methods provided predictions within 1.4–1.6 pH units. Another study comparing semi-empirical approaches with ab initio methods for predicting pKa values on a set of molecules containing a variety of ionizable groups, including alcohols and carboxylic acids, showed that PM6-based methods can provide predictions close to the accuracy of CBS-4B3/SMD [15].

Various ab initio methods, including CBS [16], Gaussian-n [17,18,19,20] and ccCA [21], have been applied with continuum solvation models to predict pKa values and are reported to predict pKa values as low as 0.5 pKa units from experiment, however, most of these approaches have only been employed on small molecule datasets [14, 15, 22,23,24,25,26]. Although wavefunction-based methods and composite ab initio methods provide high levels of accuracy, for larger molecules they are less attractive due to the computational expense, hence the interest in exploiting more approximate methods, such as electronic density-based approaches.

Density functional theory (DFT) methods are popular as they have been applied to an array of chemical applications, achieving desired accuracies for a broad range of gas phase reactions and properties [27, 28]. There are many DFT functionals and extensive assessments which illustrate that different functionals perform better for specific properties [29]. For calculations in the solution phase, DFT functionals are often used with implicit continuum models, such as CPCM [30], COSMO [31], and SMD [32] models, which are optimized for usage with modest levels of theory (smaller basis sets). Several studies employing hybrid functionals—including B3LYP, B97-1, BMK, B98, M06, and M06-2X—with the SMD model have shown that the M06-2X functional provides more accurate predictions than other functionals considered for main group element calculations, which would be expected as the SMD model was parametrized using M05-2X [26, 32]. The combination of the M06-2X density functional and the SMD model has been used in a recent pKa study that examined the effects of tuning the solvent-accessible surface describing the solute-solvent boundary and reported that mean unsigned errors of 0.9, 0.4, and 0.5 pKa units for carboxylic acids, aliphatic amines, and thiols, respectively, could be obtained by scaling the solute radii; however, this approach only had a significant impact on thiols as the default radii yielded mean unsigned errors of 1.3, 1.0, and 4.9 pKa units respectively for carboxylic acids, aliphatic amines, and thiols [33]. While different groups are evaluating their methods on different datasets, it is difficult to compare the various approaches. SAMPL blind challenges provide a unique platform for designing novel approaches and assessing current methods. The need for appropriate methods for the prediction of pKa was highlighted in the previous SAMPL5 challenge for predicting partition coefficients, as the ionization and tautomerization states differed in the cyclohexane and water phase [34, 35]. The SAMPL6 pKa challenge entails the prediction of microscopic and macroscopic pKa values divided into three sub-challenges: (1) the prediction of microscopic pKa values of associated microstates; (2) the prediction of microstate population as a function of pH ranging from 2 to 12; and (3) the prediction of the macroscopic pKa. The dataset is composed of 24 drug-like fragments, each containing multiple ionization and tautomeric states (Fig. 1).

Fig. 1
figure 1

Structures of the 24 molecules in the SAMPL6 pKa challenge

In this work for the SAMPL6 challenge, we explored several unique approaches to predict microscopic and macroscopic pKa values. Absolute pKa values were predicted using three different calculation schemes: the direct scheme, the vertical scheme, and the adiabatic scheme. We consider multiple tactics in efforts to achieve more accurate predictions. For each scheme, we tried to improve the accuracy by (1) single point energy corrections utilizing larger basis sets; (2) including multiple conformations per microstate in the pKa calculation; (3) including explicit water molecules to stabilize neutral and charged microstates; and (4) applying a linear correction to the calculated pKa values.

Methods

A source of error in pKa calculations arises from the reaction scheme used to approximate the solution phase free energy (ΔGaq). For a generic acid (HA) in water, the equilibrium of acid dissociation reaction (Ka) can be written symbolically as:

$${\text{HA}}+{{\text{H}}_2}{\text{O}} \rightleftharpoons {{\text{A}}^ - }+{{\text{H}}_3}{{\text{O}}^+};{\text{ }}{K_a}=\frac{{\left[ {{{\text{A}}^ - }} \right]\left[ {{{\text{H}}_3}{{\text{O}}^+}} \right]}}{{\left[ {{\text{HA}}} \right]\left[ {{{\text{H}}_2}{\text{O}}} \right]}}$$
(2)

which expresses the proton transfer from the acid to yield its conjugate base (A) and hydronium (H3O+). For this expression, the direct thermodynamic cycle (Fig. 2) is used for calculating absolute pKa values. In concentrated aqueous solutions, the expression can be simplified to the dissociation of an acid into its conjugate base (Cycle B). Previous studies comparing thermodynamic cycles with continuum solvation models highlight that the simplified expression, Cycle B, tends to be more accurate than Cycle A [24]. In Cycle A, the solution phase free energy is computed using the gas phase \((\Delta {G_{{\text{gas}}}})\) and solvation free energies \((\Delta {G_{\text{S}}})\). The solvation free energy of the proton, \(\Delta G_{{\text{S}}}^{*}({{\text{H}}^+}),\) used is − 265.9 kcal/mol [36] includes the standard state correction from 1 atm to 1 M. The proton gas phase free energy \((G_{{{\text{gas}}}}^{ \circ }({{\text{H}}^+})= - 6.28{\text{ kcal/mol}})\) comes from the Sackur–Tetrode equation [37].

Fig. 2
figure 2

Thermodynamic cycles used for pKa calculation schemes

$$\Delta {G_{{\text{aq}}}}=\Delta {G_{{\text{gas}}}}+\Delta \Delta {G_{\text{S}}}$$
(3a)
$$\Delta G_{{{\text{gas}}}}^{*}=G_{{{\text{gas}}}}^{ \circ }({{\text{H}}^+})+G_{{{\text{gas}}}}^{ \circ }({{\text{A}}^ - }) - G_{{{\text{gas}}}}^{ \circ }({\text{HA}})+RT{\text{ }}\ln \left( {\frac{{RT}}{P}} \right)$$
(3b)
$$\Delta G_{{\text{S}}}^{*}=\Delta G_{{\text{S}}}^{ * }({{\text{H}}^+})+\Delta G_{{\text{S}}}^{ * }({{\text{A}}^ - }) - \Delta G_{{\text{S}}}^{ * }({\text{HA}})$$
(3c)

Here, we use the superscript “°” to denote the condition of 1 atm and “*” to denote the condition of 1 M.

Calculation schemes

In this challenge, three different schemes are used to compute the free energy for each microstate pair. The notations \({\mathbf{R}_\mathbf{g}}\) and \({\mathbf{R}_\mathbf{l}}\) correspond to stationary points obtained from gas phase and solution phase optimizations, respectively [38].

Scheme D: direct scheme

The direct scheme (noted Scheme D) determines the solution phase free energy without use of thermodynamic cycle.

$${G^{\text{D}}}={E_{{\text{aq}}}}\left( {{\mathbf{R}_\mathbf{l}}} \right)+G_{{{\text{aq}}}}^{{{\text{corr}}}}\left( {{\mathbf{R}_\mathbf{l}}} \right)$$
(4)

In this scheme, the reaction free energy is determined by solution phase geometries. Thermal corrections to the free energy \({G^{{\text{corr}}}}\)are added to the total energy to approximate \(\Delta {G_{{\text{aq}}}}\). To note, all energy terms of the direct scheme are computed within the implicit solvent model. The approximation made in the direct scheme is that gas phase contributions are not needed, i.e. geometries.

Scheme V: vertical scheme

In contrast, the vertical scheme (Scheme V) uses the gas phase geometry and assumes that free energy of the solute relaxing in solution phase is negligible.

$${G^{\text{V}}}={E_{{\text{gas}}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)+G_{{{\text{gas}}}}^{{{\text{corr}}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)+\Delta {G_{\text{S}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)$$
(5a)
$$\Delta {G_{\text{S}}}={E_{{\text{aq}}}}\left( {{\mathbf{R}_\mathbf{g}}} \right) - {E_{\text{gas}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)$$
(5b)

In this expression, \(\Delta {G_{{\text{aq}}}}\) is calculated using the gas phase free energy and the solvation free energy \((\Delta {G_{\text{S}}}),\) which is the difference between the gas phase and solution phase total energies. Here, \({E_{{\text{aq}}}}\) is determined by employing the continuum solvation approach on the gas phase structure. Thermal corrections to the gas phase free energy \(G_{{{\text{gas}}}}^{{{\text{corr}}}}\) are used in this representation, as it is assumed that the thermal contributions in both phases are similar.

Scheme A: adiabatic scheme

The adiabatic scheme (Scheme A) considers both the gas and solution phase geometries.

$${G^{\text{A}}}={E_{{\text{gas}}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)+G_{{{\text{gas}}}}^{{{\text{corr}}}}+\Delta {G_{\text{S}}}$$
(6a)
$$\Delta {G_{\text{S}}}={E_{{\text{aq}}}}\left( {{\mathbf{R}_\mathbf{l}}} \right) - {E_{\text{gas}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)$$
(6b)

This scheme differs from the vertical scheme by the total energy contributions from the solute relaxed in solution, hence \({E_{{\text{aq}}}}\) is determined by optimizing the molecule in solution phase. The difference between the thermal contributions in gas phase and solution phase (relaxed) can be approximated by the difference in the adiabatic and direct scheme.

$$(\Delta \Delta G_{{{\text{D}} \to {\text{A}}}}^{{{\text{corr}}}}=\Delta G_{{{\text{gas}}}}^{{{\text{corr}}}} - \Delta G_{{{\text{aq}}}}^{{{\text{corr}}}};{\text{ }}\Delta {G^{{\text{D}} \to {\text{A}}}}={G^{\text{A}}} - {G^{\text{D}}})$$
(7)

Conventionally, the thermodynamic cycle is used to calculate the solution phase free energy when using continuum solvation models. The primary reason is that continuum solvation models are generally parameterized to produce accurate solvation free energies using lower levels of theory (HF or DFT with double-\(\zeta\) quality basis sets); however, by using the thermodynamic cycle the solution phase free energy can be determined at different levels of theory.

Inspired by the work of Ho [39], we consider modifications of each scheme in hopes to obtain more accurate energetics by including single point energy corrections (augmented by “+S”) using larger basis sets (denoted by a superscript, H). In the D + S Scheme, the total energy term in aqueous solution \({E_{{\text{aq}}}}\) is replaced with the total energy obtained with a larger basis set.

$${G^{{\text{D}}+{\text{S}}}}=E_{{{\text{aq}}}}^{{\text{H}}}\left( {{\mathbf{R}_\mathbf{l}}} \right)+G_{{{\text{aq}}}}^{{{\text{corr}}}}\left( {{\mathbf{R}_\mathbf{l}}} \right)$$
(8)

For the vertical and adiabatic schemes, the solvation free energies \((\Delta {G_\text{S}})\) are calculated with larger basis sets,

$${G^{{\text{V+S, A+S}}}}={E_{{\text{gas}}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)+G_{{{\text{gas}}}}^{{{\text{corr}}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)+\Delta G_{{\text{S}}}^{{\text{H}}}$$
(9a)
$$\Delta G_{{\text{S}}}^{{\text{H}}}=E_{{{\text{aq}}}}^{{\text{H}}}\left( {{\mathbf{R}_\mathbf{x}}} \right) - E_{{\text{gas}}}^{{\text{H}}}\left( {{\mathbf{R}_\mathbf{g}}} \right),{\text{ x}}={\text{l for Scheme A}},{\text{ x}}={\text{g for Scheme V}}$$
(9b)

As both approaches use thermodynamic cycle, the V + S and A + S Schemes differ by the geometry \(({\mathbf{R}_\mathbf{x}})\) in which the aqueous phase total energies are determined.

$${G^{{\text{A}}+{\text{S}}}} - {G^{{\text{V}}+{\text{S}}}}=E_{{{\text{aq}}}}^{{\text{H}}}\left( {{\mathbf{R}_\mathbf{l}}} \right) - E_{{{\text{aq}}}}^{{\text{H}}}\left( {{\mathbf{R}_\mathbf{g}}} \right)$$
(10)

Microstate populations as a function of pH

To predict the fractional microstate populations at different pH values, we consider the following acid-dissociation reaction in which a microstate with charge n is transformed to a microstate with charge m upon a loss of (n-m) protons, where m < n.

$${({{\text{H}}_n}{\text{A}})^{n+}}\xrightarrow{{{K_a}(n|m)}}{({{\text{H}}_m}{\text{A}})^{m+}}+(n - m){{\text{H}}^+}; {K_a}\left( {n|m} \right) \equiv \frac{{\left[ {{{({{\text{H}}_m}{\text{A}})}^{m+}}} \right]{{\left[ {{{\text{H}}^+}} \right]}^{\left( {n - m} \right)}}}}{{\left[ {{{({{\text{H}}_n}{\text{A}})}^{n+}}} \right]}}$$
(11a)

By expressing the free energy of each microstate indexed with its respective charge, the expression for the equilibrium constant can be written as

$${K_a}\left( {n|m} \right)={\exp \left({ - \frac{{G(m)+(n - m)G({{\text{H}}^+}) - G(n)}}{{kT}}}\right)}$$
(11b)

Using these two expressions for the equilibrium constant,

$$\frac{{\left[ {{{({{\text{H}}_m}{\text{A}})}^{m+}}} \right]{{\left[ {{{\text{H}}^+}} \right]}^{\left( {n - m} \right)}}}}{{\left[ {{{({{\text{H}}_n}{\text{A}})}^{n+}}} \right]}}={\exp\left({ - \frac{{G(m)+(n - m)G({{\text{H}}^+}) - G(n)}}{{kT}}}\right)}$$
(12)

By the separation of variables, we can define a expression for a microstate (here using microstate n of charge n), in which

$$\frac{{\left[ {{{({{\text{H}}_m}{\text{A}})}^{m+}}} \right]}}{{{{\left[ {{{\text{H}}^+}} \right]}^m}{\exp\left({ - \frac{{G(m) - mG({{\text{H}}^+})}}{{kT}}}\right)}}}=\frac{{\left[ {{{({{\text{H}}_n}{\text{A}})}^{n+}}} \right]}}{{{{\left[ {{{\text{H}}^+}} \right]}^n}{\exp\left({ - \frac{{G(n) - nG({{\text{H}}^+})}}{{kT}}}\right)}}} \equiv \frac{{\left[ {{{({{\text{H}}_n}{\text{A}})}^{n+}}} \right]}}{{Q\left( n \right)}}$$
(13)

\(Q(n)\) is the partition function at specified pH value and defined as

$$Q(n) \equiv {\left[ {{{\text{H}}^+}} \right]^n}{\exp\left({ - \frac{{G(n) - nG({{\text{H}}^+})}}{{kT}}}\right)}$$
(14)

Therefore, the partition function for microstate A with charge nA is

$${Q_A}({n_A})={\exp\left({ - \frac{{{G_A}({n_A}) - {n_A}G({{\text{H}}^+})}}{{kT}} - {n_A}\ln (10){\text{pH}}}\right)}$$
(15)

Note that this partition function also holds when nA < 0.

As this generalized expression can be used for any microstate X with charge nX, the fractional population (PA) for microstate A with charge nA is obtained as

$${P_A}({n_A})=\frac{{{Q_A}({n_A})}}{{\mathop \sum \nolimits_{{X= \cdots ,A, \cdots }} {Q_X}({n_X})}}$$
(16)

Macroscopic pK a values

To compute the macroscopic pKa values, we can use the expression for the microstate population to express the macroscopic equilibrium constant,

$$K_{a}^{{{\text{Macro}}}}\left( {n+1|n} \right)=\frac{{\left[ {{{\text{H}}^+}} \right]P(n)}}{{P(n+1)}}=\frac{{\left[ {{{\text{H}}^+}} \right]\sum\nolimits_{X} {{Q_X}({n_X}){\delta _{n,{n_X}}}} }}{{\mathop \sum \nolimits_{X} {Q_X}({n_X}){\delta _{(n+1),{n_X}}}}}=\frac{{{\exp\left({ - \frac{{G({{\text{H}}^+})}}{{kT}}}\right)}\mathop \sum \nolimits_{X} {\exp\left({ - \frac{{{G_X}({n_X})}}{{kT}}}\right)}{\delta _{n,{n_X}}}}}{{\mathop \sum \nolimits_{X} {\exp\left({ - \frac{{{G_X}({n_X})}}{{kT}}}\right)}{\delta _{(n+1),{n_X}}}}}$$
(17)

where \({\delta _{i,j}}\) is the Kronecker delta function. The macroscopic pKa between the microstates with a charge of n + 1 and the microstates with a charge of n is 

$$pK_{a}^{{{\text{Macro}}}}\left( {n+1|n} \right)= - \log \frac{{{\exp\left({ - \frac{{G({{\text{H}}^+})}}{{kT}}}\right)}\mathop \sum \nolimits_{X} {\exp\left({ - \frac{{{G_X}({n_X})}}{{kT}}}\right)}{\delta _{n,{n_X}}}}}{{\mathop \sum \nolimits_{X} {\exp\left({ - \frac{{{G_X}({n_X})}}{{kT}}}\right)}{\delta _{(n+1),{n_X}}}}}= - \log \frac{{\mathop \sum \nolimits_{X} {\exp\left({ - \frac{{{G_X}({n_X})}}{{kT}}}\right)}{\delta _{n,{n_X}}}}}{{\mathop \sum \nolimits_{X} {\exp\left({ - \frac{{{G_X}({n_X})}}{{kT}}}\right)}{\delta _{(n+1),{n_X}}}}}+\frac{{G\left( {{{\text{H}}^+}} \right)}}{{kT\ln 10}}$$
(18)

QM calculations

The initial structures of the 352 microstates were generated from the SMILES strings provided by the SAMPL6 pKa challenge using Open Babel 2.4.1 [40]. Gas phase and solution phase geometry optimizations were performed using the M06-2X density functional [41]. As charged and uncharged species are represented in the molecule set, the 6-31G(d) basis set [42] is used for cationic species whereas additional diffuse functions (6-31+G(d) [43]) are included for the anionic microstates. All QM optimizations were performed with “tight” wave function and geometry convergence criteria, by using an “ultrafine” numerical quadrature as required by M06-2X functional.

To maintain consistency of the basis sets between microstate reaction pairs, duplicate calculations are carried out for neutral species using each basis set (Table 2).

Frequencies were examined to confirm stationary points and scaled by 0.9465 and 0.9500 for methods using the 6-31G(d) and 6-31+G(d) basis sets, respectively [44]. Additional single point energy calculations for each microstate are performed using M06-2X in conjunction with 6-311G(d,p) and 6-311++G(d,p) to serve as corrections to the respective double-\(~\zeta\) basis sets. Solution phase geometry optimizations and single point calculations were carried out using the SMD implicit solvation model [32]. All calculations were performed in Gaussian 16 (Rev. A.03) [45] using an ultrafine integration grid. To improve conformational sampling, two different algorithms were considered. Per microstate, ten low-energy conformers were stochastically and systematically generated using the MOE software [46] and compared against the optimized structures of each microstate. For microstates in which there was a large difference in the conformation, the new conformers were subjected to the aforementioned workflow.

Results and discussion

Our method of using two basis sets is similar to the method using mixed basis set where the diffuse functions are added at the reactive center to allow improved modeling of anionic species [47]. We do not adopt using mixed basis sets because the excess electron is assumed to be delocalized over the entire molecule instead of the deprotonated atom.

Errors in pKa calculations arise from the reaction scheme in which the aqueous free energy is approximated. In this challenge, we considered several approaches for predicting absolute pKa values that differ by how free energy contributions in gas phase and solution phase are determined. Our submissions for Type I, Type II, and Type III predictions, per scheme, are listed in Table 1. To note, the calculated pKa values are reported without standard error of the mean (SEM).

Table 1 SAMPL6 submission IDs for our approaches

Direct scheme

In the direct scheme, the aqueous free energy is determined only by solution phase calculations, avoiding the thermodynamic cycle. This is an attractive approach as it requires only two calculations (of each microstate pair) and would already account for solvent-induced effects since the geometries are optimized in the solution phase. From the results shown in Table 2, overall, the direct approach predicts pKa values within a mean absolute deviation (MAD) of 1.36 pKa units from experiment. Some of the major outliers include SM01, SM06, SM14, SM23. SM18 and SM23 suffer from the hydrogen bonding effect. These molecules can form stronger hydrogen bond interactions with their functional group (the hydroxyl group of phenol or the amino group of aniline) which is reflected in the macroscopic pKa, while other molecules can also suffer from the hydrogen bonding effect but less significantly because the hydrogen bonds being formed are much weaker. Some of the conformations were biased as the implicit solvation model cannot account for the hydrogen bonding effectively.

Table 2 Basis sets selection per sub-challenge

A previous study comparing the accuracy of the direct scheme with a low (MP2) and high (G3) level of theory, reported that use of a higher-level of theory improves the MAD with respect to experiment for carboxylic, inorganic, and cationic acids using the direct scheme from 0.4 to 0.9 pKa units [39]. Rather than using a different method, we consider improving the quality of the basis set to represent a better level of theory for this challenge. In most cases, adding additional basis functions yields poorer predictions, as great as 5.0 pKa units away from the direct scheme. This excludes SM04, SM07, SM20, SM22, and SM24, as we see that using a larger basis set yields predictions of an average of 0.5 pKa units closer to experiment (1.3 pKa units difference for SM20).

Vertical scheme

The vertical scheme utilizes gas phase geometries and the thermodynamic cycle to approximate the free energy of solvation. By contrasting the direct and vertical scheme, the difference in the gas phase contribution and solution phase contribution to the solvation free energy is highlighted. Overall, the vertical scheme provides overestimations of the pKa values, yielding a MAE of 1.74 pKa units. To note, this is greater than the MAE for the direct method (This corresponds to a difference of 0.38 pKa units or a 0.5 kcal/mol free energy difference distributed in the difference of the geometries). Compared to the direct scheme, the vertical scheme overestimates the pKa for SM06 and SM09. This poorer performance of the vertical scheme is surprising as this approach is similiar to the methods in which continuum solvation models are parameterized.

As the vertical scheme assumes the gas phase geometry, it works well for the small or rigid molecules (e.g. SM02, SM05, SM09, etc.), and we consider using larger basis sets for the solvation free energy term (Eq. 3c). In most cases, the inclusion of triple-\(~\zeta\) basis sets improves the predictions by an average of 0.1–0.2 pKa units with respect to experiment. Cases in which the trend does not follow (in which the larger basis set yields predictions greater than that predicted using smaller basis sets), occur for polyprotic molecules, such as SM15 and SM22.

Adiabatic scheme

Considering both optimized gas phase and solution phase structures is hypothesized to provide more accurate pKa predictions as it includes the energetic compensation for relaxing in solvent. Using the adiabatic scheme, this yields pKa values with a MAE of 1.26 pKa units. Comparing the two thermodynamic cycle-based approaches, the adiabatic scheme provides more accurate pKa values than the vertical scheme for 64% of the molecules. This highlights that the structures determined in both gas phase and solution phase are significant for determining pKa values.

Similar to the direct and vertical schemes, we examine how using a larger basis set impacts the solvation free energy. The results indicate that using a triple-\(~\zeta\)-level basis set for the solvation free energy term improves the pKa predictions by an average of 0.2 pKa units.

Comparison of the schemes

Overall, the results in Table 3 illustrate a hierarchy of the different reaction schemes. Contrasting the three schemes, pKa values determined via the direct scheme and adiabatic scheme are closer to experiment than those predicted using the vertical scheme. However, this relationship only holds to the level of theory employed for each reaction scheme (in this case, using M06-2X with a double-\(~\zeta\) level basis set). When applying a larger basis set to the solvation free energy term, the adiabatic and vertical scheme have less error (MAE is 1.10 and 1.48, respectively) with respect to experiment than the direct scheme (MAE is 1.95).

Table 3 Absolute macroscopic pKa values via the direct (D), vertical (V), and adiabatic (A) schemes

Our submissions to the SAMPL6 challenge (Table 1), did not include the standard state correction (which made a difference in 1.39 pKa units) and also used another value for the free energy of solvation of a proton not recommended (a difference of 0.22 pKa units); this has been corrected. These results are encouraging as the pKa predictions via the adiabatic and direct schemes correlate well with experiment, having a correlation coefficient greater than 0.9.

To confirm if the approach predicts the proper chemistry, we evaluate the different schemes on a small subset of molecules that share a similar scaffold, differing by electron donation or withdrawing groups. The molecules SM02, SM04, SM07, SM09, SM12, and SM13 share the 4-aminoquinazoline scaffold. Ranked by acidity, SM02, SM12, and SM09 differ by substituents on the phenyl ring spanning a variance of 0.35 pKa units. The direct schemes are unable to properly determine the trend, as the predictions indicate that SM12 is more acidic than SM02 (SM12 has a Ph-Cl whereas SM02 has a Ph-CF3). In contrast, the vertical schemes rank the acidities of SM02 and SM12 correctly, however, overestimate the acidity of the SM09. This is believed to result from using the gas phase geometry, as only one low energy conformation was considered and the more probable representations that more closely resemble the structure in the solution phase were neglected. SM13 has a larger pKa and is different as it contains electronic donating groups on the quinazoline as opposed to the amino group. The direct and vertical schemes overestimate the acidity relative to SM02, SM12, and SM09. The difference between SM04 and SM07 is small, quantitatively and qualitatively (0.04 pKa units). Interestingly, only the direct scheme was able to properly rank the acidities for these molecules. We also compare the microscopic pKa values with respect to experiment for these molecules and observe the same trends (Table S6).

Room for improvement

Aside of the chosen level of theory employed, another source of error arises from the lack of explicit interactions between the solute and water, which are not accounted for in continuum solvation models. For example, functional groups such as alcohol and phenols have ionic states that may be stabilized in solution by hydrogen bonding. Including explicit water molecules with continuum solvation models, also termed microsolvation or cluster-continuum modeling, has been shown to improve pKa predictions for such issues [48]. In general, this could result in overestimation or underestimation of pKa values for acids and bases.

For example, the pKa values for molecules SM01, SM15, and SM22, which may undergo deprotonation at the phenol group, were overestimated by 1.3–5.0 pKa units. As a proof of concept, we tried to improve pKa predictions for SM01 by adding water molecules near the hydroxyl group. Adding one water molecule improves the prediction of the pKa by an average of 1.3 pKa units (Fig. 3). By saturating the hydroxyl group with three water molecules, the pKa improves by an average of 3.0 pKa units (Table S3).

Fig. 3
figure 3

Effect of microsolvation on pKa calculations schemes for SM01

Relative schemes

When employing the different calculation schemes for this challenge, we only considered predicting absolute pKa values as opposed to relative pKa values. Relative schemes for calculating pKa use empirical parameters to scale or offset the solute phase free energy.

$${\text{p}}{K_a}=A\frac{{\Delta {G_{{\text{aq}}}}}}{{RT\ln 10}}+B$$
(19)

Using a relative scheme as an offset (A = 1) to the free energy entails identifying and applying (subjectively) good reference models, which relies on chemical intuition. As this challenge includes 620 unique acid–base pairs, identifying the proper reference models proved difficult since the molecules had multiple protonation sites. Alternatively, a linear regression fit can be applied to the calculated solution phase free energy to correct for systematic errors (e.g. concentration of water, proton solvation free energy, model chemistry, etc.). As this is a popular approach for calculating pKa [47, 49], we consider applying a linear regression correction to each scheme. To determine the parameters A and B, two training sets, consisting of 63 acids (Table S2) and 56 bases (Table S3), were used. The linear fitting parameters determined for each scheme can be found in the Supporting Information (Table S4). For each scheme, while applying a linear regression fit does not improve the correlation \((\Delta {\text{R}^2}= \pm 0.01),\) this approach does improve the pKa predictions, with a lower MAE and RMSE than the respective absolute calculation pKa schemes (Table 4).

Table 4 Comparison of linear regression fit macroscopic pKa values via direct (D), vertical (V), and adiabatic (A) schemes with experiment

We believe the reason that the slope of the experimental pKa vs calculated pKa is not the expected value of 1 is due to the hydrogen bonding effect. Since hydrogen bond interactions can stabilize the charged species while having little effect on the neutral species, the pKa values for the bases are usually underestimated while those for the acids are usually overestimated when explicit considerations of the hydrogen bonds between the solvent and solute are absent. The slope can approach the expected value of 1 by including explicit waters [48].

Multiple minima consideration

All pKa values have been determined using one conformation per microstate pair. The molecules within the SAMPL6 pKa data set are not rigid (excluding SM01 and SM22) and can adopt multiple conformations that satisfy local minima. To probe if the exclusion of multiple minima was a source of error in our pKa calculations, we generate 6 to 32 different conformations for each microstate of SM06 and re-calculate the macroscopic pKa by sequentially including the lowest energy conformations per microstate. As shown in Table 5, including multiple minima has little impact to the pKa prediction (0.1–0.6 pKa units). By applying the linear regression fit, the pKa predictions are closer to experiment using one conformation per microstate. Including additional conformations per microstate yields a maximum difference of 0.3 pKa units (Table S5).

Table 5 Macroscopic pKa values of SM06 determined as a function by microstate conformations

Conclusion

In this study, three calculations schemes were used to predict the pKa of molecules as a part of the SAMPL6 challenge. The adiabatic scheme yields more accurate pKa predictions than the direct and vertical schemes. Using a larger basis set with the adiabatic scheme yields the best results among the other schemes, yielding an RMSE of 1.40 pKa units. A combination of popular and inexpensive methods (M06-2X/Pople basis sets (6-31G(d)/6-311G(d,p) or 6-31+G(d)/6-311++G(d,p))//SMD) was used in our approach, which means that this approach can be carried out in most popular software packages. Without additional parameterization, we have a very encouraging result with an R2 of 0.93 by using different basis sets for different charged species. However, if a linear regression fit is applied, the pKa predictions are improved (RMSE of 0.73 and R2 of 0.94). This approach can be further improved as there are still multiple sources of error from the electronic structure method, basis set, and solvation model.