Introduction

The acid dissociation constant (Ka) describes the protonation state equilibrium of a molecule given pH. More commonly, we refer to \(\text {p}K_{\text{a}}\) \( = -\log _{10} K_a\), its negative logarithmic form. Predicting \(\text {p}K_{\text{a}}\) is a prerequisite for predicting many other properties of small molecules such as their protein binding affinity, distribution coefficient (log D), membrane permeability, and solubility. As a major aim of computer-aided drug design (CADD) is to aid in the assessment of pharmaceutical and physicochemical properties of virtual molecules prior to synthesis to guide decision-making, accurate computational \(\text {p}K_{\text{a}}\) predictions are required in order to accurately model numerous properties of interest to drug discovery programs.

Ionizable sites are found often in drug molecules and influence their pharmaceutical properties including target affinity, ADME/Tox, and formulation properties [1]. It has been reported that most drugs are ionized in the range of 60–90% at physiological pH [2]. Drug molecules with titratable groups can exist in many different charge and protonation states based on the pH of the environment. Given that experimental data of protonation states and \(\text {p}K_{\text{a}}\) are often not available, we rely on predicted \(\text {p}K_{\text{a}}\) values to determine which charge and protonation states the molecules populate and the relative populations of these states, so that we can assign the appropriate dominant protonation state(s) in fixed-state calculations or the appropriate solvent state weights/protonation penalty to calculations considering multiple states.

The pH of the human gut ranges between 1 and 8, and 74% of approved drugs can change ionization state within this physiological pH range [3]. Because of this, \(\text {p}K_{\text{a}}\) values of drug molecules provide essential information about their physicochemical and pharmaceutical properties. A wide distribution of acidic and basic \(\text {p}K_{\text{a}}\) values, ranging from 0 to 12, have been observed in approved drugs [1, 3].

Drug-like molecules present difficulties for \(\text {p}K_{\text{a}}\) prediction compared with simple monoprotic molecules. Drug-like molecules are frequently multiprotic, have large conjugated systems, often contain heterocycles, and can tautomerize. In addition, drug-like molecules with significant conformational flexibility can form intramolecular hydrogen bonding, so that conformational changes can significantly shift their \(\text {p}K_{\text{a}}\) values. This presents further challenges for modeling methods, where deficiencies in solvation models may mispredict the propensity for intramolecular hydrogen bond formation.

Predicting \(\text {p}K_{\text{a}}\)s of drug-like molecules accurately is a prerequisite for computational drug discovery and design. Small molecule \(\text {p}K_{\text{a}}\) predictions can influence computational protein–ligand binding affinities in multiple ways. Errors in \(\text {p}K_{\text{a}}\) predictions can cause modeling the wrong charge and tautomerization states which affect the ligand hydrogen bonding opportunities and charge distribution. The dominant protonation state and relative populations of minor states in aqueous medium is dictated by the molecule’s \(\text {p}K_{\text{a}}\) values. The relative free energy of different protonation states in the aqueous state is a function of pH, and contributes to the overall protein–ligand affinity in the form of a free energy penalty for populating higher energy protonation states [4]. Any error in predicting the free energy of a minor aqueous protonation state of a ligand that dominates the complex binding free energy will directly add to the error in the predicted binding free energy, and selecting the incorrect dominant protonation state altogether can lead to even larger modeling errors. Similarly for log D predictions, an inaccurate prediction of protonation states and their relative free energies will be detrimental to the accuracy of transfer free energy predictions.

For a monoprotic weak acid (HA) or base (B)—whose dissociation equilibria are shown in Eq. 1—the acid dissociation constant is expressed as in Eq. 2, or, commonly, in its negative base-10 logarithmic form as in Eq. 3. The ratio of ionization states can be calculated with Henderson–Hasselbalch equations shown in Eq. 4.

$$\begin{aligned}&HA \rightleftharpoons A^- + H^+;\quad BH^+ \rightleftharpoons B + H^+ \end{aligned}$$
(1)
$$\begin{aligned}&K_a = \frac{[A^-][H^+]}{[HA]};\quad K_a = \frac{[B][H^+]}{[B^+]} \end{aligned}$$
(2)
$$\begin{aligned}&pK_a = -\log _{10}{K_a} \end{aligned}$$
(3)
$$\begin{aligned}&pH = pK_a + \log _{10}{\frac{[A^-]}{[HA]}};\quad pH = pK_a + \log _{10}{\frac{[B]}{[BH^+]}} \end{aligned}$$
(4)

For multiprotic molecules, the definition of \(\text {p}K_{\text{a}}\) diverges into macroscopic \(\text {p}K_{\text{a}}\) and microscopic \(\text {p}K_{\text{a}}\) [5,6,7]. Macroscopic \(\text {p}K_{\text{a}}\) describes the equilibrium dissociation constant between different charged states of the molecule. Each charge state can be composed of multiple tautomers. Macroscopic \(\text {p}K_{\text{a}}\) thus determines the deprotonation of the molecule, rather than the location of the titratable group. A microscopic \(\text {p}K_{\text{a}}\) describes the acid dissociation equilibrium between individual tautomeric states of different charges. (There is no \(\text {p}K_{\text{a}}\) defined between tautomers of the same charge as they have the same number of protons and their relative populations are independent of pH.) The microscopic \(\text {p}K_{\text{a}}\) determines the identity and population distribution of tautomers within each charge state. Thus, each macroscopic charge state of a molecule can be composed of multiple microscopic tautomeric states. The microscopic \(\text {p}K_{\text{a}}\) value defined between two microstates captures the deprotonation of a single titratable group with other titratable groups held in a fixed background protonation state. In molecules with multiple titratable groups, the protonation state of one group can affect the proton dissociation propensity of another functional group, therefore the same titratable group may have different proton affinities (microscopic \(\text {p}K_{\text{a}}\) values) based on the protonation state of the rest of the molecule.

Different experimental methods are sensitive to changes in the total charge or the location of individual protons, so they measure different definitions of \(\text {p}K_{\text{a}}\)s, as explained in more detail in prior work [8]. Most common \(\text {p}K_{\text{a}}\) measurement techniques such as potentiometric and spectrophotometric methods measure macroscopic \(\text {p}K_{\text{a}}\)s, while NMR measurements can determine microscopic \(\text {p}K_{\text{a}}\)s by measuring microstate (tautomer) populations with respect to pH. Therefore, it is important to pay attention to the source and definition of \(\text {p}K_{\text{a}}\) values in order to correctly interpret their meaning.

Many computational methods can predict both microscopic and macroscopic \(\text {p}K_{\text{a}}\)s. While experimental measurements more often provide only macroscopic \(\text {p}K_{\text{a}}\)s, microscopic \(\text {p}K_{\text{a}}\) predictions are more informative for determining relevant microstates (microscopic protonation states and tautomers) of a molecule and their relative free energies. Predicted microstate populations can be converted to predicted macroscopic \(\text {p}K_{\text{a}}\)s for direct comparison with experimentally obtained macroscopic \(\text {p}K_{\text{a}}\)s. In this paper, we explore approaches to assess the performance of both macroscopic and microscopic \(\text {p}K_{\text{a}}\) predictions, taking advantage of available experimental data.

Microscopic \(\text {p}K_{\text{a}}\) predictions can be converted to macroscopic \(\text {p}K_{\text{a}}\) predictions either directly with Eq. 5 [9],

$$\begin{aligned} K_a^\text {macro} = \sum _{j=1}^{N_\text {deprot}} \frac{1}{\sum _{i=1}^{N_\text {prot}}\frac{1}{ K_{ij}^\text {micro}}} \quad , \end{aligned}$$
(5)

or through computing the macroscopic free energy of deprotonation between ionization states with charges N and \(N-1\) via Boltzmann-weighted sum of the relative free energy of microstates (\(G_i\)) as in Eqs. 6 and 7 [10].

$$\Delta G_{N-1, N} = RT \ln {\frac{\sum _i e^{-G_i/RT} \delta _{N_i, N-1}}{\sum _i e^{-G_i/RT} \delta _{N_i, N}}}$$
(6)
$$pK_a= pH - \frac{\Delta G_{N-1, N}}{RT \ln {10}}$$
(7)

In Eq. 6\(\Delta G_{N-1, N}\) is the effective macroscopic protonation free energy. \(\delta _{N_i, N-1}\) is equal to unity when the microstate i has a total charge of \(N-1\) and zero otherwise. RT is the ideal gas constant times the absolute temperature.

Motivation for a blind \(\text {p}K_{\text{a}}\) challenge

SAMPL (Statistical Assessment of the Modeling of Proteins and Ligands) is a series of annual computational prediction challenges for the computational chemistry community. The goal of the SAMPL community is to evaluate the current performance of computational models and to bring the attention of the quantitative biomolecular modeling field on problems that limit the accuracy of protein–ligand binding models. SAMPL Challenges aim to help computer-aided drug discovery make sustained progress toward higher accuracy by focusing the community on one isolated accuracy-limiting problem at a time. By conducting a series of blind challenges—which often feature the computation of specific physical properties critical for protein–ligand modeling—and encouraging rapid sharing of lessons learned, SAMPL aims to accelerate progress toward quantitative accuracy in modeling.

SAMPL Challenges that focus on physical properties have assessed intermolecular binding models of various protein–ligand and host–guest systems, as well as the prediction of hydration free energies and distribution coefficients to date. These blind challenges motivate improvements in computational methods by revealing unexpected sources of error, identifying features of methods that perform well or poorly, and enabling the participants to share information after each successive challenge. Previous SAMPL Challenges have focused on the limitations of force field accuracy, finite sampling, solvation modeling defects, and tautomer/protonation state predictions on protein–ligand binding predictions.

During the SAMPL5 log D Challenge, the performance of models in predicting cyclohexane-water log D was worse than expected—accuracy suffered when protonation states and tautomers were not taken into account [11, 12]. Many participants simply submitted log P predictions as if they were equivalent to log D, and many were not prepared to account for the contributions of different ionization states to the distribution coefficient in their models. Challenge results highlighted that log P predictions were not an accurate approximation of log D without capturing protonation state effects. The calculations were improved by including the free energy penalty of the neutral state which relies on obtaining an accurate \(\text {p}K_{\text{a}}\) prediction [11]. With the goal of deconvoluting the different sources of error contributing to the large errors observed in the SAMPL5 log D Challenge, we organized separate \(\text {p}K_{\text{a}}\) and log P challenges in SAMPL6 [8, 13, 14]. For this iteration of the SAMPL challenge, we isolated the problem of predicting aqueous protonation states and associated \(\text {p}K_{\text{a}}\) values.

This is the first time a blind \(\text {p}K_{\text{a}}\) prediction challenge has been fielded as part of SAMPL. In this challenge, we aimed to assess the performance of current \(\text {p}K_{\text{a}}\) prediction methods for drug-like molecules, investigate potential causes of inaccurate \(\text {p}K_{\text{a}}\) estimates, and determine how the current level of accuracy of these models might impact the ability to make quantitative predictions of protein–ligand binding affinities.

Approaches to predict small molecule \(\text {p}K_{\text{a}}\)s

There are a large variety of \(\text {p}K_{\text{a}}\) prediction methods developed for the prediction of aqueous \(\text {p}K_{\text{a}}\)s of small molecules. Broadly, we can divide \(\text {p}K_{\text{a}}\) predictions as knowledge-based empirical methods and physical methods. Empirical methods include the following categories: Database Lookup (DL)  [15], Linear Free Energy Relationship (LFER) [16,17,18], Quantitative Structure-Property Relationship (QSPR) [19,20,21,22], and Machine Learning (ML) approaches [23, 24]. DL methods rely on the principle that structurally similar compounds have similar \(\text {p}K_{\text{a}}\) values and utilize an experimental database of complete structures or fragments. The \(\text {p}K_{\text{a}}\) value of the most similar database entry is reported as the predicted \(\text {p}K_{\text{a}}\) of the query molecule. In the QSPR approach, the \(\text {p}K_{\text{a}}\) values are predicted as a function of various quantitative molecular descriptors, and the parameters of the function are trained on experimental datasets. A function in the form of multiple linear regression is common, although more complex forms can also be used such as the artificial neural networks in ML methods. The LFER approach is the oldest \(\text {p}K_{\text{a}}\) prediction strategy. They use Hammett–Taft type equations to predict \(\text {p}K_{\text{a}}\) based on classification of the molecule to a parent class (associated with a base \(\text {p}K_{\text{a}}\) value) and two parameters that describe how the base \(\text {p}K_{\text{a}}\) value must be modified given its substituents. Physical modeling of \(\text {p}K_{\text{a}}\) predictions requires Quantum Mechanics (QM) models. QM methods are often utilized together with linear empirical corrections (LEC) that are designed to rescale and unbias QM predictions for better accuracy. Classical molecular mechanics-based \(\text {p}K_{\text{a}}\) prediction methods are not feasible as deprotonation is a covalent bond breaking event that can only be captured by QM. Constant-pH molecular dynamics methods can calculate \(\text {p}K_{\text{a}}\) shifts of multiple titratable groups in large biomolecular systems where there is low degree of coupling between protonation sites and linear summation of protonation energies (initially determined in a reference solvent) can be assumed [25]. However, this approach can not generally be applied to small organic molecule due to the high degree of coupling between protonation sites [26,27,28].

Methods

Design and logistics of the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge

Fig. 1
figure 1

Distribution of molecular properties of the 24 compounds from the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge. b Histogram of spectrophotometric \(\text {p}K_{\text{a}}\) measurements collected with Sirius T3  [8]. The overlaid rug plot indicates the actual values. Five compounds have multiple measured \(\text {p}K_{\text{a}}\)s in the range of 2–12. b Histogram of molecular weights calculated for the neutral state of the compounds in the SAMPL6 set. Molecular weights were calculated by neglecting counterions. c Histogram of the number of non-terminal rotatable bonds in each molecule. d The histogram of the ratio of heteroatom (non-carbon heavy atoms including, O, N, F, S, Cl, Br, I) count to the number of carbon atoms

The SAMPL6 \(\text {p}K_{\text{a}}\) Challenge was conducted as a blind prediction challenge and focused on predicting aqueous \(\text {p}K_{\text{a}}\) values of 24 small molecules not previously reported in the literature. The challenge set was composed of molecules that resemble fragments of kinase inhibitors. Heterocycles that are frequently found in FDA-approved kinase inhibitors were represented in this set. The compound selection process was described in depth in the prior publication reporting SAMPL6 \(\text {p}K_{\text{a}}\) Challenge experimental data collection [8]. The distribution of molecular weights, experimental \(\text {p}K_{\text{a}}\) values, number of rotatable bonds, and heteroatom to carbon ratio are depicted in Fig. 1. The challenge molecule set was composed of 17 small molecules with limited flexibility (less than 5 non-terminal rotatable bonds) and 7 molecules with 5–10 non-terminal rotatable bonds. The distribution of experimental \(\text {p}K_{\text{a}}\) values was roughly uniform between 2 and 12. 2D representations of all compounds are provided in Fig. 5. Drug-like molecules are often larger and more complex than the ones used in this study. We limited the size and the number of rotatable bonds of compounds to create molecule set of intermediate difficulty.

The dataset composition and experimental details—without the identity of the small molecules—were announced approximately one month before the challenge start date. Experimental macroscopic \(\text {p}K_{\text{a}}\) measurements were collected using a spectrophotometric method with the Sirius T3 (Sirius Analytical), at room temperature, in ionic strength-adjusted water with 0.15 M KCl [8]. The instructions for participation and the identity of the challenge molecules were released on the challenge start date (October 25, 2017). A table of molecule IDs (in the form of SM##) and their canonical isomeric SMILES, defining individual protonation and tautomer states, was provided as input. Blind prediction submissions were accepted until January 22, 2018.

Following the conclusion of the blind challenge, the experimental data was made public on January 23, 2018. The SAMPL organizers and participants gathered at the Second Joint D3R/SAMPL Workshop at UC San Diego, La Jolla, CA on February 22–23, 2018 to share results. The workshop aimed to create an opportunity for participants to discuss the results, evaluate methodological choices by comparing the performance of different methods, and share lessons learned from the challenge. Participants reported their results and their own evaluations in a special issue of the Journal of Computer-Aided Molecular Design [29].

While designing this first \(\text {p}K_{\text{a}}\) prediction challenge, we did not know the optimal format to capture \(\text {p}K_{\text{a}}\) predictions of participants. We wanted to capture all necessary information needed to evaluate the submitted \(\text {p}K_{\text{a}}\) predictions. Our strategy was to directly evaluate macroscopic \(\text {p}K_{\text{a}}\) predictions comparing them to experimental macroscopic \(\text {p}K_{\text{a}}\) values and to use collected microscopic \(\text {p}K_{\text{a}}\) prediction data for more in-depth diagnostics of method performance. Therefore, we asked participants to submit their predictions in three different submission types:

  • Type I: microscopic \(\text {p}K_{\text{a}}\) values and related microstate pairs

  • Type II: fractional microstate populations as a function of pH in 0.1 pH increments

  • Type III: macroscopic \(\text {p}K_{\text{a}}\) values

For each submission type, a machine-readable submission file template was specified. For type I submissions, participants were asked to report the microstate ID of the protonated state, the microstate ID of deprotonated state, the microscopic \(\text {p}K_{\text{a}}\), and the predicted microscopic \(\text {p}K_{\text{a}}\) standard error of the mean (SEM). The method of microstate enumeration and why it was needed are discussed further in Sect. 2.2 “Enumeration of Microstates”. The SEM aims to capture the statistical uncertainty of the prediction method. Microstate IDs were preassigned identifiers for each microstate in the form of SM##_micro###. For type II submissions, the submission format included a table that started with a microstate ID column and a set of columns reporting the natural logarithm of fractional microstate population values of each predicted microstate for 0.1 pH increments between pH 2 and 12. For type III submissions participants were asked to report molecule ID, macroscopic \(\text {p}K_{\text{a}}\), and macroscopic \(\text {p}K_{\text{a}}\) SEM.

We required participants to submit predictions for all fields for each prediction, but it was not mandatory to submit predictions for all the molecules or all three submission types. Although we accepted submissions with partial sets of molecules, it would have been a better choice to require predictions for all the molecules for a better comparison of overall method performance. The submission files also included fields for naming the method, listing the software utilized, and a free text section to describe the methodology used in detail.

Participants were allowed to submit predictions for multiple methods as long as they created separate submission files. While anonymous participation was allowed, all participants opted to make their submissions public. Blind submissions were assigned a unique 5-digit alphanumeric submission ID, which will be used throughout this paper. Unique IDs were also assigned when multiple submissions exist for different submissions types of the same method such as microscopic \(\text {p}K_{\text{a}}\) (type I) and macroscopic \(\text {p}K_{\text{a}}\) (type III). These submission IDs were also reported in the evaluation papers of participants to allow cross-referencing. Submission IDs, participant-provided method names, and method categories are presented in Table 1. In many cases, multiple types of submissions (type I, II, and III) of the same method were provided by participants as challenge instructions requested. Although each prediction set was assigned a separate submission ID, we matched the submissions that originated from the same method according to the reports of the participants for cases where multiple sets of predictions came from a given method. Submission IDs for both macroscopic (type III) and microscopic (type I) \(\text {p}K_{\text{a}}\) predictions for each method are shown in Table 1.

Enumeration of microstates

To capture both the \(\text {p}K_{\text{a}}\) value and titrating proton position for microscopic \(\text {p}K_{\text{a}}\) predictions, we needed microscopic \(\text {p}K_{\text{a}}\) values to be reported together with a pair of microstates which describe the protonated and deprotonated states corresponding to each microscopic transition. String representations of molecules such as canonical SMILES with explicit hydrogens can be written, however, there can be inconsistencies between the interpretation of canonical SMILES written by different software and algorithms. To avoid complications while reading microstate structure files from different sources, we decided that the safest route was pre-enumerating all possible microstates of challenge compounds, assigning microstate IDs to each in the form of SM##_micro###, and requiring participants to report microscopic \(\text {p}K_{\text{a}}\) values along with microstate pairs specified by the provided microstates IDs.

We created initial sets of microstates with Schrödinger Epik [30] and OpenEye QUACPAC [31] and took the union of results. Microstates with Epik were generated using Schrödinger Suite v2016-4, running Epik to enumerate all tautomers within 20 \(\text {p}K_{\text{a}}\) units of pH 7. For enumerating microstates with OpenEye QUACPAC, we had to first enumerate formal charges and for each charge enumerate all possible tautomers using the settings of maximum tautomer count 200, level 5, with carbonyl hybridization set to False. Then we created a union of all enumerated states written as canonical isomeric SMILES generated by OpenEye OEChem [32]. Even though resonance structures correspond to different canonical isomeric SMILES, they are not different microstates, therefore it was necessary to remove resonance structures that were replicates of the same tautomer. To detect equivalent resonance structures, we converted canonical isomeric SMILES to InChI hashes with explicit and fixed hydrogen layer. Structures that describe the same tautomer but different resonance states lead to explicit hydrogen InChI hashes that are identical, allowing replicates to be removed. The Jupyter Notebook used for the enumeration of microstates is provided in Supplementary Information.

We provided microstate ID tables with canonical SMILES and 2D depictions to aid participants in matching predicted structures to microstate IDs. A canonical SMILES representation was selected over canonical isomeric SMILES, because resonance and geometric isomerism do not lead to different microstates according to our working microstate definition. The only exception was for molecule SM20, which should be consistently modeled as the E-isomer.

Despite combining enumerated charge states and tautomers generated by both Epik and OpenEye QUACPAC, to our surprise, the microstate lists were still incomplete. During the course of the SAMPL6 Challenge, participants identified new microstates that were not present in the initial list that we provided. Based on participant requests for new microstates, we iteratively had to update the list of microstates and assign new microstate IDs. Every time we received a request, we shared the updated microstate ID lists with all challenge participants. Some participants updated their \(\text {p}K_{\text{a}}\) prediction by including the newly added microstates in their calculations. In the future, developing a better algorithm that can enumerate all possible microstates (not just the ones with significant populations) would be very beneficial for anticipating microstates that may be predicted by \(\text {p}K_{\text{a}}\) prediction methods.

A microscopic \(\text {p}K_{\text{a}}\) definition was provided in challenge instructions for clarity as follows: Physically meaningful microscopic \(\text {p}K_{\text{a}}\)s are defined between microstate pairs that can interconvert by single protonation/deprotonation event of only one titrable group. So, microstate pairs should have total charge (absolute) difference of 1 and only one heavy atom that differs in the number of associated hydrogens, regardless of resonance state or geometric isomerism. All geometric isomer and resonance structure pairs that have the same number of hydrogens bound to equivalent heavy atoms are grouped into the same microstate where they can influence the microscopic pKa. Pairs of resonance structures and geometric isomers (cis/trans, stereo) are not considered as different microstates, as long as there is no change in the number of hydrogens bound to each heavy atom. Transitions where there are shifts in the position of protons coupled to changes in the number of protons were also not considered as microscopic \(\text {p}K_{\text{a}}\)  values [26]. Since we wanted participants to report only microscopic \(\text {p}K_{\text{a}}\)s that describe single deprotonation events (in contrast to transitions between microstates that are different in terms of two or more titratable protons), we have also provided a pre-enumerated list of allowed microstate pairs.

Provided microstate ID and microstate pair lists were intended to be used for reporting microstates to aid parsing of submissions. The enumerated lists of microstates were not created with the intent to guide computational predictions. This was clearly stated in the challenge instructions. However, we noticed that some participants still used the microstate lists as an input for their \(\text {p}K_{\text{a}}\) predictions as we received complaints from participants that due to our updates to microstate lists they needed to repeat their calculations. This would not have been an issue if participants used \(\text {p}K_{\text{a}}\) prediction protocols that did not rely on an external pre-enumerated list of microstates as an input. None of the participants reported this dependency in their method descriptions explicitly, so it was also not obvious how participants were using the provided states in their predictions. We could not identify which submissions used these enumerated microstate lists as input for predictions and which have followed the challenge instructions and relied only on their prediction method to generate microstates.

Evaluation approaches

Since the experimental data for the challenge was mainly composed of macroscopic \(\text {p}K_{\text{a}}\) values of both monoprotic and multiprotic compounds, evaluation of macroscopic and microscopic \(\text {p}K_{\text{a}}\) predictions was not straightforward. For a subset of 8 molecules, the dominant microstate sequence could be inferred from NMR experiments. For the rest of the molecules, the only experimental information available was the macroscopic \(\text {p}K_{\text{a}}\) value. The experimental data—in the form of macroscopic \(\text {p}K_{\text{a}}\)  values—did not provide any information on which group(s) are being titrated, the microscopic \(\text {p}K_{\text{a}}\) values, the identity of the associated macrostates (which total charge), or microstates (which tautomers). Also, experimental data did not provide any information about the charge state of protonated and deprotonated species associated with each macroscopic \(\text {p}K_{\text{a}}\). Typically charges of states associated with experimental \(\text {p}K_{\text{a}}\) values are assigned based on \(\text {p}K_{\text{a}}\) predictions, not experimental evidence, but we did not utilize such computational charge assignment. For a fair performance comparison between methods, we avoided relying on any particular \(\text {p}K_{\text{a}}\) prediction to assist the interpretation of the experimental reference data. This choice complicated the \(\text {p}K_{\text{a}}\) prediction analysis, especially regarding how to pair experimental and predicted \(\text {p}K_{\text{a}}\)  values for error analysis. We adopted various evaluation strategies guided by the experimental data. To compare macroscopic \(\text {p}K_{\text{a}}\) predictions to experimental values, we had to utilize numerical matching algorithms before we could calculate performance statistics. For the subset of molecules with experimental data about microstates, we used microstate-based matching. These matching methods are described in more detail in the next section.

Three types of submissions were collected during the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge. We have only utilized the type I (microscopic \(\text {p}K_{\text{a}}\) value and microstate IDs) and the type III (macroscopic \(\text {p}K_{\text{a}}\) value) predictions in this article. Type I submissions contained the same prediction information as the type II submissions which reported the fractional population of microstates with respect to pH. We collected type II submissions in order to capture relative populations of microstates, not realizing they were redundant. The microscopic \(\text {p}K_{\text{a}}\) predictions collected in type I submissions capture all the information necessary to calculate type II submissions. Therefore, we did not use type II submissions for challenge evaluation. In theory, type III (macroscopic \(\text {p}K_{\text{a}}\)) predictions can also be calculated from type I submissions, but collecting type III submissions allowed the participation of \(\text {p}K_{\text{a}}\) prediction methods that directly predict macroscopic \(\text {p}K_{\text{a}}\)  values without considering microspeciation and methods that apply special empirical corrections for macroscopic \(\text {p}K_{\text{a}}\) predictions.

Matching algorithms for pairing predicted and experimental \(\text {p}K_{\text{a}}\)  values

Macroscopic \(\text {p}K_{\text{a}}\) predictions can be calculated from microscopic \(\text {p}K_{\text{a}}\)  values for direct comparison to experimental macroscopic \(\text {p}K_{\text{a}}\) values. One major question must be answered to allow this comparison: How should we match predicted macroscopic \(\text {p}K_{\text{a}}\)  values to experimental macroscopic \(\text {p}K_{\text{a}}\)  values when there could multiple \(\text {p}K_{\text{a}}\) values reported for a given molecule? For example, experiments on SM18 showed three macroscopic \(\text {p}K_{\text{a}}\)s, but prediction of xvxzd method reported two macroscopic \(\text {p}K_{\text{a}}\) values. There were also examples of the opposite situation with more predicted \(\text {p}K_{\text{a}}\) values than experimentally determined macroscopic \(\text {p}K_{\text{a}}\)s: One experimental \(\text {p}K_{\text{a}}\) was measured for SM02, but two macroscopic \(\text {p}K_{\text{a}}\)  values were predicted by xvxzd method. The experimental and predicted values must be paired before any prediction error can be calculated, even though there was not any experimental information regarding underlying tautomer and charge states.

Knowing the charges of macrostates would have guided the pairing between experimental and predicted macroscopic \(\text {p}K_{\text{a}}\)  values, however, not all experimental \(\text {p}K_{\text{a}}\) measurements can determine the charge of the protonation states being titrated. The potentiometric \(\text {p}K_{\text{a}}\) measurements just captures the relative charge change between macrostates, but not the absolute value of the charge. Thus, our experimental data did not provide any information that would indicate the titration site, the overall charge, or the tautomer composition of macrostate pairs that are associated with each measured macroscopic \(\text {p}K_{\text{a}}\) that could guide the matching between predicted and experimental \(\text {p}K_{\text{a}}\) values.

For evaluating macroscopic \(\text {p}K_{\text{a}}\) predictions taking the experimental data as reference, Fraczkiewicz [23] delineated recommendations for fair comparative analysis of computational \(\text {p}K_{\text{a}}\) predictions. They recommended that, in the absence of any experimental information that would aid in matching, experimental and computational \(\text {p}K_{\text{a}}\)  values should be matched preserving the order of \(\text {p}K_{\text{a}}\) values and minimizing the sum of absolute errors.

We picked the Hungarian matching algorithm [33, 34] to match experimental and predicted macroscopic \(\text {p}K_{\text{a}}\)  values with a squared error cost function as suggested by Kiril Lanevskij via personal communication. The algorithm is available in the SciPy package (scipy.optimize.linear_sum_assignment) [35]. This matching algorithm provides optimum global assignment that minimizes the linear sum of squared errors of all pairwise matches. We selected the squared error cost function instead of the absolute error cost function to avoid misordered matches, For instance, for a molecule with experimental \(\text {p}K_{\text{a}}\) values of 4 and 6, and predicted \(\text {p}K_{\text{a}}\)  values of 7 and 8, Hungarian matching with absolute error cost function would match 6 to 7 and 4 to 9. Hungarian matching with squared error cost would match 4 to 7 and 6 to 9, preserving the increasing \(\text {p}K_{\text{a}}\) value order between experimental and predicted values. A weakness of this approach would be failing to match the experimental value of 6 to predicted value of 7 if that was the correct match based on underlying macrostates. But the underlying pair of states were unknown to us both because the experimental data did not determine which charge states the transitions were happening between and also because we did not collect the pair of macrostates associated with each \(\text {p}K_{\text{a}}\) predictions in submissions. Requiring this information for macroscopic \(\text {p}K_{\text{a}}\) predictions in future SAMPL challenges would allow for better comparison between predictions, even if experimental assignment of charges is not possible. There is no perfect solution to the numerical \(\text {p}K_{\text{a}}\) assignment problem, but we tried to determine the fairest way to penalize predictions based on their numerical deviation from the experimental values.

For the analysis of microscopic \(\text {p}K_{\text{a}}\) predictions we adopted a different matching approach. For the eight molecules for which we had the requisite data for this analysis, we utilized the dominant microstate sequence inferred from NMR experiments to match computational predictions and experimental \(\text {p}K_{\text{a}}\)  values. We will refer to this assignment method as microstate matching, where the experimental \(\text {p}K_{\text{a}}\) value is matched to the computational microscopic \(\text {p}K_{\text{a}}\) value which was reported for the dominant microstate pair observed for each transition. We have compared the results of Hungarian matching and microstate matching.

Inevitably, the choice of matching algorithms to assign experimental and predicted values has an impact on the computed performance statistics. We believe the Hungarian algorithm for numerical matching of unassigned \(\text {p}K_{\text{a}}\) values and microstate-based matching when experimental microstates are known were the best choices, providing the most unbiased matching without introducing assumptions outside of the experimental data.

Statistical metrics for submission performance

A variety of accuracy and correlation statistics were considered for analyzing and comparing the performance of prediction methods submitted to the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge. Calculated performance statistics of predictions were provided to participants before the workshop. Details of the analysis and scripts are maintained on the SAMPL6 GitHub Repository (described in Sect. 5).

Error metrics

There are six error metrics reported for the numerical error of the \(\text {p}K_{\text{a}}\) values: the root-mean-squared error (RMSE), mean absolute error (MAE), mean error (ME), coefficient of determination (R2), linear regression slope (m), and Kendall’s Rank Correlation Coefficient (\(\tau \)). Uncertainty in each performance statistic was calculated as 95% confidence intervals estimated by non-parametric bootstrapping (sampling with replacement) over predictions with 10,000 bootstrap samples. Calculated errors statistics of all methods can be found in Table S2 for macroscopic \(\text {p}K_{\text{a}}\) predictions and S4 and S4 for microscopic \(\text {p}K_{\text{a}}\) predictions.

Assessing macrostate predictions

In addition to assessing the numerical error in predicted \(\text {p}K_{\text{a}}\) values, we also evaluated predictions in terms of their ability to capture the correct macrostates (ionization states) and microstates (tautomers of each ionization state) to the extent possible from the available experimental data. For macroscopic \(\text {p}K_{\text{a}}\)s, the spectrophotometric experiments do not directly report on the identity of the ionization states. However, the number of ionization states indicates the number of macroscopic \(\text {p}K_{\text{a}}\)s that exists between the experimental range of 2.0–12.0. For instance, SM14 has two experimental \(\text {p}K_{\text{a}}\)s and therefore three different charge states observed between pH 2.0 and 12.0. If a prediction reported 4 macroscopic \(\text {p}K_{\text{a}}\)s, it is clear that this method predicted an extra ionization state. With this perspective, we reported the number of unmatched experimental \(\text {p}K_{\text{a}}\)s (the number of missing \(\text {p}K_{\text{a}}\) predictions, i.e., missing ionization states) and the number of unmatched predicted \(\text {p}K_{\text{a}}\)s (the number of extra \(\text {p}K_{\text{a}}\) predictions, i.e., extra ionization states) after Hungarian matching. The latter count was restricted to only predictions with \(\text {p}K_{\text{a}}\) values between 2 and 12 because that was the range of the experimental method. Errors in extra or missing \(\text {p}K_{\text{a}}\) prediction errors highlight failure to predict the correct number of ionization states within a pH range.

Assessing microstate predictions

For the evaluation of microscopic \(\text {p}K_{\text{a}}\) predictions, taking advantage of the available dominant microstate sequence data for a subset of 8 compounds, we calculated the dominant microstate prediction accuracy which is the ratio of correct dominant tautomer predictions for each charge state divided by the total number of dominant tautomer predictions. Dominant microstate prediction accuracy was calculated over all experimentally detected ionization states of each molecule which were part of this analysis. In order to extract the sequence of dominant microstates from the microscopic \(\text {p}K_{\text{a}}\) predictions sets, we calculated the relative free energy of microstates selecting a neutral tautomer and pH 0 as reference following Eq. 8. Calculation of relative microstate free energies was explained in more detail in a previous publication [26].

The relative free energy of a state with respect to reference state B at pH 0.0 (arbitrary pH value selected as reference) can be calculated as follows:

$$\begin{aligned} \Delta G_{AB} = \Delta m_{AB} \,RT\ln {10}\,(pH - pK_{a}) \end{aligned}$$
(8)

\(\Delta m_{AB}\) is equal to the number protons in state A minus that in state B. R and T indicate the molar gas constant and temperature, respectively. By calculating relative free energies of all predicted microstates with respect to the same reference state and pH, we were able to determine the sequence of predicted dominant microstates. The dominant tautomer of each charge state was determined as the microstate with the lowest free energy in the subset of predicted microstates of each ionization state. This approach is feasible because the relative free energy of tautomers of the same ionization state is independent of pH and therefore the choice of reference pH is arbitrary.

Identifying consistently top-performing methods

We created a shortlist of top-performing methods for macroscopic and microscopic \(\text {p}K_{\text{a}}\) predictions. The top macroscopic \(\text {p}K_{\text{a}}\) predictions were selected if they ranked in the top 10 consistently according to two error metrics (RMSE, MAE) and two correlation metrics (R-Squared, and Kendall’s Tau), while also having fewer than eight missing or extra macroscopic \(\text {p}K_{\text{a}}\)s for the entire molecule set (eight macrostate errors correspond to macrostate prediction mistake in roughly one third of the 24 compounds). These methods are presented in Table 2. A separate list of top-performing methods was constructed for microscopic \(\text {p}K_{\text{a}}\) with the following criteria: ranking in the top 10 methods when ranked by accuracy statistics (RMSE and MAE) and perfect dominant microstate prediction accuracy. These methods are presented in Table 3.

Determining challenging molecules

In addition to comparing the performance of methods, we also wanted to compare \(\text {p}K_{\text{a}}\) prediction performance for each molecule to determine which molecules were the most challenging for \(\text {p}K_{\text{a}}\) predictions considering all the methods in the challenge. For this purpose, we plotted prediction error distributions of each molecule calculated over all prediction methods. We also calculated MAE for each molecule over all prediction sets as well as for predictions from each method category separately.

Reference calculations

Including a null model is helpful in comparative performance analysis of predictive methods to establish what the performance statistics look like for a baseline method for the specific dataset. Null models or null predictions employ a simple prediction model which is not expected to be particularly successful, but it provides a simple point of comparison for more sophisticated methods. The expectation or goal is for more sophisticated or costly prediction methods to outperform the predictions from a null model, otherwise the simpler null model would be preferable. In SAMPL6 \(\text {p}K_{\text{a}}\) Challenge there were two blind submissions using database lookup methods that were submitted to serve as null predictions. These methods, with submission IDs 5nm4j and 5nm4j both used OpenEye pKa-Prospector database to find the most similar molecule to query molecule and simply reported its \(\text {p}K_{\text{a}}\) as the predicted value. Database lookup methods with a rich experimental database do present a challenging null model to beat, however, due to the accuracy level needed from \(\text {p}K_{\text{a}}\) predictions for computer-aided drug design we believe such methods provide an appropriate performance baseline that physical and empirical \(\text {p}K_{\text{a}}\) prediction methods should strive to outperform.

We also included additional reference calculations in the comparative analysis to provide more perspective. Some widely used methods by academia and industry were missing from the blind challenge submission. Therefore, we included those methods as reference calculations: Schrödinger/Epik (nb007, nb008, nb010), Schrödinger/Jaguar (nb011, nb013), Chemaxon/Chemicalize (nb015), and Molecular Discovery/MoKa (nb016, nb017). Epik and Jaguar \(\text {p}K_{\text{a}}\) predictions were collected by Bas Rustenburg, Chemicalize predictions by Mehtap Isik, and MoKa predictions by Thomas Fox. All were done after the challenge deadline avoiding any alterations to their respective standard procedures and any guidance from experimental data. Experimental data was publicly available before these calculations were complete, therefore reference calculations were not formally considered as blind submissions.

All figures and statistics tables in this manuscript include reference calculations. As the reference calculations were not formal submissions, these were omitted from formal ranking in the challenge, but we present plots in this article which show them for easy comparison. These are labeled with submission IDs of the form nb### to clearly indicate non-blind reference calculations.

Results and discussion

Participation in the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge was high with 11 research groups contributing \(\text {p}K_{\text{a}}\) prediction sets for 37 methods. A large variety of \(\text {p}K_{\text{a}}\) prediction methods were represented in the SAMPL6 Challenge. We categorized these submissions into four method classes: database lookup (DL), linear free energy relationship (LFER), quantitative structure-property relationship or machine learning (QSPR/ML), and quantum mechanics (QM). Quantum mechanics models were subcategorized into QM methods with and without linear empirical correction (LEC), and combined quantum mechanics and molecular mechanics (QM + MM). Table 1 presents method names, submission IDs, method categories, and also references for each approach. Integral equation-based approaches (e.g.EC-RISM) were also evaluated under the Physical (QM) category. There were 2 DL, 4 LFER, and 5 QSPR/ML methods represented in the challenge, including the reference calculations. The majority of QM calculations include linear empirical corrections (22 methods in QM + LEC category), and only 5 QM methods were submitted without any empirical corrections. There were 4 methods that used a mixed physical modeling approach of QM + MM.

The following sections present a detailed performance evaluation of blind submissions and reference prediction methods for macroscopic and microscopic \(\text {p}K_{\text{a}}\) predictions. Performance statistics of all the methods can be found in Tables S2 and S4. Methods are referred to by their submission ID’s which are provided in Table 1.

Table 1 Submission IDs, names, category, and type for all the \(\text {p}K_{\text{a}}\) prediction sets

Analysis of macroscopic \(\text {p}K_{\text{a}}\) predictions

Fig. 2
figure 2

RMSE and unmatched \(\text {p}K_{\text{a}}\) counts vs. submission ID plots for macroscopic \(\text {p}K_{\text{a}}\) predictions based on Hungarian matching. Methods are indicated by submission IDs. RMSE is shown with error bars denoting 95% confidence intervals obtained by bootstrapping over challenge molecules. Submissions are colored by their method categories. Light blue colored database lookup methods are utilized as the null prediction method. QM methods category (navy) includes pure QM, QM+LEC, and QM+MM approaches. Lower bar plots show the number of unmatched experimental \(\text {p}K_{\text{a}}\)  values (light grey, missing predictions) and the number of unmatched \(\text {p}K_{\text{a}}\) predictions (dark grey, extra predictions) for each method between pH 2 and 12. Submission IDs are summarized in Table 1. Submission IDs of the form nb### refer to non-blinded reference methods computed after the blind challenge submission deadline. All others refer to blind, prospective predictions

The performance of macroscopic \(\text {p}K_{\text{a}}\) predictions was analyzed by comparison to experimental \(\text {p}K_{\text{a}}\) values collected by the spectrophotometric method via numerical matching following the Hungarian method. Overall \(\text {p}K_{\text{a}}\) prediction performance was worse than we hoped. Figure 2 shows RMSE calculated for each prediction method represented by their submission IDs. Other performance statistics are depicted in Fig. 3. In both figures, method categories are indicated by the color of the error bars. The statistics depicted in these figures can be found in Table S2. Prediction error ranged between 0.7 to 3.2 \(\text {p}K_{\text{a}}\) units in terms of RMSE, while an RMSE between 2 and 3 log units was observed for the majority of methods (20 out of 38 methods). Only five methods achieved RMSE less than 1 \(\text {p}K_{\text{a}}\) unit. One is QM method with COSMO-RS approach for solvation and linear empirical correction (xvxzd (DSD-BLYP-D3(BJ)/def2-TZVPD//PBEh-3c[DCOSMO-RS] + RRHO(GFN-xTB[GBSA]) + Gsolv(COSMO-RS[TZVPD]) and linear fit)), and the remaining four are empirical prediction methods of LFER (xmyhm (ACD/pKa Classic), nb007 (Schrödinger/Epik Scan)) and QSPR/ML categories (gyuhx (Simulations Plus), nb017 (MoKa)). These five methods with RMSE less than 1 \(\text {p}K_{\text{a}}\) unit are also the methods that have the lowest MAE. xmyhm and xvxzd were the only two methods for which the upper 95% confidence interval of RMSE was lower than 1 \(\text {p}K_{\text{a}}\) unit.

In terms of correlation statistics, many methods have good performance, although the ranking of methods changes according to R2 and Kendall’s Tau. Therefore, many methods are indistinguishable from one another, considering the uncertainty of the correlation statistics. 32 out of 38 methods have R2 and Kendall’s Tau higher than 0.7 and 0.6, respectively. 8 methods have R2 higher than 0.9 and 6 methods have Kendall’s Tau higher than 0.8. The overlap of these two sets are the following: gyuhx (Simulations Plus), xvxzd (DSD-BLYP-D3(BJ)/def2-TZVPD//PBEh-3c[DCOSMO-RS] + RRHO(GFN-xTB[GBSA]) + Gsolv(COSMO-RS[TZVPD]) and linear fit), xmyhm (ACD/pKa Classic), ryzue (Adiabatic scheme with single point correction: MD/M06-2X//6-311++G(d,p)//M06-2X/6-31+G(d) for bases and SMD/M06-2X//6-311++G(d,p)//M06-2X/6-31G(d) for acids + thermal corrections), and 5byn6 (Adiabatic scheme: thermodynamic cycle that uses gas phase optimized structures for gas phase free energy and solution phase geometries for solvent phase free energy. SMD/M06-2X/6-31+G(d) for bases and SMD/M06-2X/6-31G(d) for acids + thermal corrections). It is worth noting that ryzue and 5byn6 are QM predictions without any empirical correction. Their high correlation and rank correlation coefficient scores signal that with an empirical correction their accuracy based performance could improve. Indeed, the participants have shown that this is the case in their own challenge analysis paper and achieved RMSE of 0.73 \(\text {p}K_{\text{a}}\) units after the challenge [41].

Null prediction methods based on database lookup (5nm4j and pwn3m) had similar performance, with an RMSE of roughly 2.5 \(\text {p}K_{\text{a}}\) units, an MAE of 1.5 \(\text {p}K_{\text{a}}\) units, R2 of 0.2, and Kendall’s Tau of 0.3. Many methods were observed to have a prediction performance advantage over the null predictions shown in light blue in Figs. 2 and 3 considering all the performance metrics as a whole. In terms of correlation statistics, the null methods are the worst performers, except for 0hxtm. From the perspective of accuracy-based statistics (RMSE and MAE), only the top 10 methods were observed to have significantly lower errors than the null methods considering the uncertainty of error metrics expressed as 95% confidence intervals.

The distribution of macroscopic \(\text {p}K_{\text{a}}\) prediction signed errors observed in each submission was plotted in Fig. 7A as ridge plots using the Hungarian matching scheme. 2ii2g, f0gew, np64b, p0jba, and yc70m tended to overestimate, while 5byn6, ryzue, and w4iyd tended to underestimate macroscopic \(\text {p}K_{\text{a}}\) values.

Four submissions in the QM+LEC category used the COSMO-RS implicit solvation model. While three of these achieved the lowest RMSE among QM-based methods (xvxzd, yqkga, and 8xt50) [46], one of them showed the highest RMSE (0hxtm (COSMOtherm_FINE17)) among all SAMPL6 Challenge macroscopic \(\text {p}K_{\text{a}}\) predictions. All four methods used COSMO-RS/FINE17 to compute solvation free energies. The major difference between the three low-RMSE methods and 0hxtm seems to be the protocol for determining relevant conformations for each microstate. xvxzd, yqkga, and 8xt50 used a semi-empirical tight binding (GFN-xTB) method and GBSA continuum solvation model for geometry optimization, followed by high level single-point energy calculations with a solvation free energy correction (COSMO-RS(FINE17/TZVPD)) and rigid rotor harmonic oscillator (RRHO[GFN-xTB(GBSA]) correction. yqkga, and 8xt50 selected conformations for each microstate with the Relevant Solution Conformer Sampling and Selection (ReSCoSS) workflow [46]. The conformations were clustered according to shape, and the lowest energy conformations from each cluster (according to BP86/TZVP/COSMO single point energies in any of the 10 different COSMO-RS solvents) were considered as relevant conformers. The yqkga method further filtered out conformers that have less than 5% Boltzmann weights at the DSD-BLYP-D3/def2-TZVPD + RRHO(GFNxTB) + COSMO-RS(fine) level. The xvxzd method used an MF-MD-GC//GFN-xTB workflow and energy thresholds of 6 kcal/mol and 10 kcal/mol, for conformer and microstate selection. On the other hand, the conformational ensemble captured for each microstate seems to be more limited for the 0hxtm method, judging by the method description provided in the submission file (this participant did not publish an analysis of the results that they obtained for SAMPL6). The 0hxtm method reported that relevant conformations were computed with the COSMOconf 4.2 workflow which produced multiple relevant conformers for only the neutral states of SM18 and SM22. In contrast to xvxzd, yqkga, and 8xt50, the 0hxtm method also did not include a RRHO correction. Participants who submitted the three low-RMSE methods report that capturing the chemical ensemble for each molecule including conformers and tautomers and high-level QM calculations led to more successful macroscopic \(\text {p}K_{\text{a}}\) prediction results and RRHO correction provided a minor improvement [46]. Comparing these results to other QM approaches in the SAMPL Challenge also points to the advantage of the COSMO-RS solvation approach compared to other implicit solvent models.

In addition to the statistics related to the \(\text {p}K_{\text{a}}\) value, we also analyzed missing or extra \(\text {p}K_{\text{a}}\) predictions. Analysis of the \(\text {p}K_{\text{a}}\) values with accuracy- and correlation-based error metrics was only possible after the matching of predicted macroscopic \(\text {p}K_{\text{a}}\)  values to experimental \(\text {p}K_{\text{a}}\)  values through Hungarian matching, although this approach masks \(\text {p}K_{\text{a}}\) prediction issues in the form of extra or missing macroscopic \(\text {p}K_{\text{a}}\) predictions. To capture this class of prediction errors, we reported the number of unmatched experimental \(\text {p}K_{\text{a}}\)s (missing \(\text {p}K_{\text{a}}\) predictions) and the number of unmatched predicted \(\text {p}K_{\text{a}}\)s (extra \(\text {p}K_{\text{a}}\) predictions) after Hungarian matching for each method. Both missing and extra \(\text {p}K_{\text{a}}\) prediction counts were only considered for the pH range of 2–12, which corresponds to the limits of the experimental assay. The lower subplot of Fig. 2 shows the total count of unmatched experimental or predicted \(\text {p}K_{\text{a}}\)  values for all the molecules in each prediction set. The order of submission IDs in the x-axis follows the RMSD based ranking so that the performance of each method from both \(\text {p}K_{\text{a}}\) value accuracy and the number of \(\text {p}K_{\text{a}}\)s can be viewed together. The omission or inclusion of extra macroscopic \(\text {p}K_{\text{a}}\) predictions is a critical error because inaccuracy in predicting the correct number of macroscopic transitions shows that methods are failing to predict the correct set of charge states, i.e., failing to predict the correct number of ionization states that can be observed between the specified pH range.

In the analysis of these challenge results, extra macroscopic \(\text {p}K_{\text{a}}\) predictions were found to be more common than missing \(\text {p}K_{\text{a}}\) predictions. In \(\text {p}K_{\text{a}}\) prediction evaluations, the accuracy of predicted ionization states within a pH range is usually neglected. When predictions are only evaluated for the accuracy of the \(\text {p}K_{\text{a}}\) value with numerical matching algorithms, a larger number of predicted \(\text {p}K_{\text{a}}\)s lead to greater underestimation of prediction errors. Therefore, it is not surprising that methods are biased to predict extra \(\text {p}K_{\text{a}}\) values. The SAMPL6 \(\text {p}K_{\text{a}}\) Challenge experimental data consists of 31 macroscopic \(\text {p}K_{\text{a}}\)s in total, measured for 24 molecules (6 molecules in the set have multiple \(\text {p}K_{\text{a}}\)s). Within the 10 methods with the lowest RMSE, only the xvxzd method predicts too few \(\text {p}K_{\text{a}}\) values (2 unmatched out of 31 experimental \(\text {p}K_{\text{a}}\)s). All other methods that rank in the top 10 by RMSE have extra predicted \(\text {p}K_{\text{a}}\)s ranging from 1 to 13. Two prediction sets without any extra \(\text {p}K_{\text{a}}\) predictions and low RMSE are 8xt50 (ReSCoSS conformations // DSD-BLYP-D3 reranking // COSMOtherm pKa) and nb015 (ChemAxon/Chemicalize).

Fig. 3
figure 3

Additional performance statistics for macroscopic \(\text {p}K_{\text{a}}\) predictions based on Hungarian matching. Methods are indicated by submission IDs. Mean absolute error (MAE), mean error (ME), Pearson’s R2, and Kendall’s Rank Correlation Coefficient Tau (\(\tau \)) are shown, with error bars denoting 95% confidence intervals were obtained by bootstrapping over challenge molecules. Refer to Table 1 for the submission IDs and method names. Submissions are colored by their method categories. Light blue colored database lookup methods are utilized as the null prediction method

Consistently well-performing methods for macroscopic \(\text {p}K_{\text{a}}\) prediction

Methods ranked differently when ordered by different error metrics, although there were a couple of methods that consistently ranked in the top fraction. By using combinatorial criteria that take multiple statistical metrics and unmatched \(\text {p}K_{\text{a}}\) counts into account, we identified a shortlist of consistently well-performing methods for macroscopic \(\text {p}K_{\text{a}}\) predictions, shown in Table 2. The criteria for selection were the overall ranking in Top 10 according to RMSE, MAE, R2, and Kendall’s Tau and also having a combined unmatched \(\text {p}K_{\text{a}}\) (extra and missing \(\text {p}K_{\text{a}}\)s) count less than 8 (a third of the number of compounds). We ranked methods in ascending order for RMSE and MAE and in descending order for R2, and Kendall’s Tau to determine methods. Then, we took the intersection set of Top 10 methods according to each statistic to determine the consistently-well performing methods. This resulted in a list of four methods that are consistently well-performing across all criteria.

Consistently well-performing methods for macroscopic \(\text {p}K_{\text{a}}\) prediction included methods from all categories. Two methods in the QM+LEC category were xvxzd (DSD-BLYP-D3(BJ)/def2-TZVPD//PBEh-3c[DCOSMO-RS] + RRHO(GFN-xTB[GBSA]) + Gsolv(COSMO-RS[TZVPD]) and linear fit) and (8xt50) (ReSCoSS conformations // DSD-BLYP-D3 reranking // COSMOtherm pKa) and both used COSMO-RS. Empirical \(\text {p}K_{\text{a}}\) predictions with top performance were both proprietary software. From QSPR and LFER categories, gyuhx (Simulations Plus) and xmymhm (ACD/pKa Classic) were consistently well-performing methods. The Simulation Plus \(\text {p}K_{\text{a}}\) prediction method consisted of 10 artificial neural network ensembles trained on 16,000 compounds for 10 classes of ionizable atoms, with the ionization class of each atom determined using an assigned atom type and local molecular environment [48]. The ACD/pKa Classic method was trained on 17,000 compounds, uses Hammett-type equations, and captures effects related to tautomeric equilibria, covalent hydration, resonance effects, and \(\alpha , \beta \)-unsaturated systems  [38].

Table 2 Four consistently well-performing prediction methods for macroscopic \(\text {p}K_{\text{a}}\) prediction based on consistent ranking within the Top 10 according to various statistical metrics

Figure 4 plots predicted vs. experimental macroscopic \(\text {p}K_{\text{a}}\) predictions of four consistently well-performing methods, a representative average method, and the null method(5nm4j). We selected the method with the highest RMSE below the median of all methods as the representative method with average performance: 2ii2g (EC-RISM/MP2/cc-pVTZ-P2-q-noThiols-2par).

Fig. 4
figure 4

Predicted vs. experimental macroscopic \(\text {p}K_{\text{a}}\) prediction for four consistently well-performing methods, a representative method with average performance (2ii2g), and the null method (5nm4j). When submissions were ranked according to RMSE, MAE, R2, and \(\tau \), four methods ranked in the Top 10 consistently in each of these metrics. Dark and light green shaded areas indicate 0.5 and 1.0 units of error. Error bars indicate standard error of the mean of predicted and experimental values. Experimental \(\text {p}K_{\text{a}}\) SEM values are too small to be seen under the data points. EC-RISM/MP2/cc-pVTZ-P2-q-noThiols-2par method (2ii2g) was selected as the representative method with average performance because it is the method with the highest RMSE below the median

Which chemical properties are driving macroscopic \(\text {p}K_{\text{a}}\) prediction failures?

In addition to comparing the performance of methods that participated in the SAMPL6 Challenge, we also wanted to analyze macroscopic \(\text {p}K_{\text{a}}\) predictions from the perspective of challenge molecules and determine whether particular compounds suffer from larger inaccuracy in \(\text {p}K_{\text{a}}\) predictions. The goal of this analysis is to provide insight on which molecular properties or moieties might be causing larger \(\text {p}K_{\text{a}}\) prediction errors. In Fig. 5, 2D depictions of the challenge molecules are presented with MAE calculated for their macroscopic \(\text {p}K_{\text{a}}\) predictions over all methods, based on Hungarian match. For multiprotic molecules, the MAE was averaged over all the \(\text {p}K_{\text{a}}\)  values. For the analysis of \(\text {p}K_{\text{a}}\) prediction accuracy observed for each molecule, MAE is a more appropriate statistical value than RMSE for following global trends, as it is less sensitive to outliers than the RMSE.

A comparison of the prediction accuracy of individual molecules is shown in Fig. 6. In Fig. 6A, the MAE for each molecule is shown considering all blind predictions and reference calculations. A cluster of molecules marked orange and red have higher than average MAE. Molecules marked red (SM06, SM21, and SM22) are the only compounds in the SAMPL6 dataset with bromo or iodo groups and they suffered a macroscopic \(\text {p}K_{\text{a}}\) prediction error in the range of 1.7–2.0 \(\text {p}K_{\text{a}}\) units in terms of MAE. Molecules marked orange (SM03, SM10, SM18, SM19, and SM20) have sulfur-containing heterocycles, and all these molecules except SM18 have MAE larger than 1.6 \(\text {p}K_{\text{a}}\) units. Despite containing a thiazole group, SM18 has a low prediction MAE. SM18 is the only compound with three experimental \(\text {p}K_{\text{a}}\)  values, and we suspect the presence of multiple experimental \(\text {p}K_{\text{a}}\)  values could have a masking effect on the errors captured by the MAE when the Hungarian matching scheme is used due to more potential pairing choices that may artificially lower the error.

We separately analyzed the MAE of each molecule for empirical (LFER and QSPR/ML) and QM-based physical methods (QM, QM + LEC, and QM + MM) to gain additional insight into prediction errors. Figure 6b shows that the difficulty of predicting \(\text {p}K_{\text{a}}\)  values of the same subset of molecules was a trend conserved in the performance of physical methods. For QM-based methods, sulfur-containing heterocycles, amides proximal to aromatic heterocycles, and compounds with iodo and bromo substitutions have lower \(\text {p}K_{\text{a}}\) prediction accuracy.

The SAMPL6 \(\text {p}K_{\text{a}}\) set consists of only 24 small molecules and lacks multiple examples of many moieties, limiting our ability to determine with statistical significance which chemical substructures cause greater errors in \(\text {p}K_{\text{a}}\) predictions. Still, the trends observed in this challenge point to molecules with iodo-, bromo-, and sulfur-containing heterocycles as having systematically larger prediction errors in macroscopic \(\text {p}K_{\text{a}}\) value. We hope that reporting this observation will lead to the improvement of methods for similar compounds with such moieties.

Fig. 5
figure 5

Molecules from the SAMPL6 Challenge with MAE calculated for all macroscopic \(\text {p}K_{\text{a}}\) predictions. The MAE calculated over all prediction methods indicates which molecules had the lowest prediction accuracy in the SAMPL6 Challenge. MAE values calculated for each molecule include all the matched \(\text {p}K_{\text{a}}\) values. SM06, SM14, SM15, SM16, SM18, and SM22 were multiprotic. Hungarian matching algorithm was employed for pairing experimental and predicted \(\text {p}K_{\text{a}}\) values. MAE values are reported with 95% confidence intervals

Fig. 6
figure 6

Average prediction accuracy calculated over all prediction methods was poorer for molecules with sulfur-containing heterocycles, bromo, and iodo groups. a MAE calculated for each molecule as an average of all methods. b MAE of each molecule broken out by method category. QM-based methods (blue) include QM predictions with or without linear empirical correction. Empirical methods (green) include QSAR, ML, DL, and LFER approaches. c Depiction of SAMPL6 molecules with sulfur-containing heterocycles. d Depiction of SAMPL6 molecules with iodo and bromo groups

Fig. 7
figure 7

Macroscopic \(\text {p}K_{\text{a}}\) prediction error distribution plots show how prediction accuracy varies across methods and individual molecules. a \(\text {p}K_{\text{a}}\) prediction error distribution for each submission for all molecules according to Hungarian matching. b Error distribution for each SAMPL6 molecule for all prediction methods according to Hungarian matching. For multiprotic molecules, \(\text {p}K_{\text{a}}\) ID numbers (pKa1, pKa2, and pKa3) were assigned in the direction of increasing experimental \(\text {p}K_{\text{a}}\) value

We have also looked for correlation with molecular descriptors for finding other potential explanations as to why macroscopic \(\text {p}K_{\text{a}}\) prediction errors were larger for certain molecules. While testing the correlation between errors and many molecular descriptors, it is important to account for the possibility of spurious correlations. We haven’t observed any statistically significant correlation between numerical \(\text {p}K_{\text{a}}\) predictions and the descriptors we have tested. First, having more experimental \(\text {p}K_{\text{a}}\)  values (Fig. 6a) did not seem to be associated with poorer \(\text {p}K_{\text{a}}\) prediction performance. Still, we need to keep in mind that multiprotic compounds were sparsely represented in the SAMPL6 set (5 molecules with 2 macroscopic \(\text {p}K_{\text{a}}\)  values and one with 3 macroscopic \(\text {p}K_{\text{a}}\)). Second, we checked the following other descriptors: presence of an amide group, molecular weight, heavy atom count, rotatable bond count, heteroatom count, heteroatom-to-carbon ratio, ring system count, maximum ring size, and the number of microstates (as enumerated for the challenge) [49]. Correlation plots and R2 values can be seen in Fig. S2.

We had suspected that \(\text {p}K_{\text{a}}\) prediction methods may perform better for moderate values (4–10) than extreme values as molecules with extreme \(\text {p}K_{\text{a}}\)  values are less likely to change ionization states close to physiological pH. To test this we look at the distribution of absolute errors calculated for all molecules and challenge predictions binned by experimental \(\text {p}K_{\text{a}}\) value 2 \(\text {p}K_{\text{a}}\) unit increments. As can be seen in Fig. S3B, the value of true macroscopic \(\text {p}K_{\text{a}}\)  values was not a factor affecting the prediction error seen in SAMPL6 Challenge.

Figure 7b is helpful to answer the question “Are there molecules with consistently overestimated or underestimated \(\text {p}K_{\text{a}}\)  values?”. This ridge plots show the error distribution of each experimental \(\text {p}K_{\text{a}}\). SM02_pKa1, SM04_pKa1, SM14_pKa1, and SM21_pKa1 were underestimated, predicting lower proton affinity by more than 1 \(\text {p}K_{\text{a}}\) unit by majority of the prediction methods. SM03_pKa1, SM06_pKa2, SM19_pKa1, and SM20_pKa1 were overestimated by the majority of the prediction methods by more than 1 \(\text {p}K_{\text{a}}\) unit. SM03_pKa1, SM06_pKa2, SM10_pKa1, SM19_pKa1, and SM22_pKa1 have the highest spread of errors and were less accurately predicted overall.

Analysis of microscopic \(\text {p}K_{\text{a}}\) predictions using microstates determined by NMR for 8 molecules

The most common approach for analyzing microscopic \(\text {p}K_{\text{a}}\) prediction accuracy has been to compare it to experimental macroscopic \(\text {p}K_{\text{a}}\) data, assuming experimental \(\text {p}K_{\text{a}}\)  values describe titrations of distinguishable sites and, therefore, correspond to microscopic \(\text {p}K_{\text{a}}\)s. But this typical approach fails to evaluate methods at the microscopic level.

Analysis of microscopic \(\text {p}K_{\text{a}}\) predictions for the SAMPL6 Challenge was not straightforward due to the lack of experimental data with microscopic resolution of the titratable sites and their associated microscopic \(\text {p}K_{\text{a}}\)s. For 24 molecules, macroscopic \(\text {p}K_{\text{a}}\)  values were determined with the spectrophotometric method. For 18 molecules, a single macroscopic titration was observed, and for 6 molecules multiple experimental \(\text {p}K_{\text{a}}\)  values were observed and characterized. For 18 molecules with a single experimental \(\text {p}K_{\text{a}}\), it is probable that the molecules are monoprotic and, therefore, macroscopic \(\text {p}K_{\text{a}}\) value is equal to the microscopic \(\text {p}K_{\text{a}}\). There is, however, no direct experimental evidence supporting this hypothesis aside from the support from computational predictions, such as the predictions by ACD/pKa Classic. There is always the possibility that the macroscopic \(\text {p}K_{\text{a}}\) observed is the result of a transition between mixtures of tautomers with similar energy so no one is dominant. We did not want to bias the blind challenge analysis with any prediction method. Therefore, we believe analyzing the microscopic \(\text {p}K_{\text{a}}\) predictions via Hungarian matching to experimental values with the assumption that the 18 molecules have a single titratable site is not the best approach. Instead, an analysis at the level of macroscopic \(\text {p}K_{\text{a}}\)  values is much more appropriate when a numerical matching scheme is the only option to evaluate predictions using macroscopic experimental data. However, it should be noted that as we often do not know the proton number on the two forms of the molecule connected by the titration in experiment or in the calculated pKas so that a match in values may be accidental.

For a subset of eight molecules, dominant microstates were inferred from NMR experiments. Six of these molecules were monoprotic and two were multiprotic. This dataset was extremely useful for guiding the assignment between experimental and predicted \(\text {p}K_{\text{a}}\) values based on microstates. In this section, we present the performance evaluations of microscopic \(\text {p}K_{\text{a}}\) predictions for only the 8 compounds with experimentally-determined dominant microstates.

Fig. 8
figure 8

NMR determination of dominant microstates allowed in-depth evaluation of microscopic \(\text {p}K_{\text{a}}\) predictions for 8 compounds. a Dominant microstate sequence of two compounds (SM07 and SM14) were determined by NMR [8]. Based on these reference compounds, the dominant microstates of 6 related compounds were inferred and experimental \(\text {p}K_{\text{a}}\) values were assigned to titratable groups with the assumption that only the dominant microstates have significant contributions to the experimentally observed \(\text {p}K_{\text{a}}\). b RMSE vs. submission ID and unmatched \(\text {p}K_{\text{a}}\) vs. submission ID plots for the evaluation of microscopic \(\text {p}K_{\text{a}}\) predictions of 8 molecules by Hungarian matching to experimental macroscopic \(\text {p}K_{\text{a}}\)  values. c RMSE vs. submission ID and unmatched \(\text {p}K_{\text{a}}\) vs. submission ID plots showing the evaluation of microscopic \(\text {p}K_{\text{a}}\) predictions of 8 molecules by microstate-based matching between predicted microscopic \(\text {p}K_{\text{a}}\)s and experimental macroscopic \(\text {p}K_{\text{a}}\) values. Submissions 0wfzo, z3btx, 758j8, and hgn83 have RMSE values bigger than 10 \(\text {p}K_{\text{a}}\) units which are beyond the y-axis limits of subplot c and b. RMSE is shown with error bars denoting 95% confidence intervals obtained by bootstrapping over the challenge molecules. Lower bar plots show the number of unmatched experimental \(\text {p}K_{\text{a}}\)s (light grey, missing predictions) and the number of unmatched \(\text {p}K_{\text{a}}\) predictions (dark grey, extra predictions) for each method between pH 2 and 12. Submission IDs are summarized in Table 1

Microstate-based matching revealed errors masked by \(\text {p}K_{\text{a}}\) value-based matching between experimental and predicted \(\text {p}K_{\text{a}}\)s

Comparing microscopic \(\text {p}K_{\text{a}}\) predictions directly to macroscopic experimental \(\text {p}K_{\text{a}}\) values with numerical matching can lead to underestimation of errors. To demonstrate how numerical matching often masks \(\text {p}K_{\text{a}}\) prediction errors, we compared the performance analysis done by Hungarian matching to that from microstate-based matching for 8 molecules presented in Fig. 8a. RMSE calculated for microscopic \(\text {p}K_{\text{a}}\) predictions matched to experimental values via Hungarian matching is shown in Fig. 8b, while c shows RMSE calculated via microstate-based matching. The Hungarian matching incorrectly leads to significantly (and artificially) lower RMSE compared to microstate-based matching. The reason is that the Hungarian matching assigns experimental \(\text {p}K_{\text{a}}\) values to predicted \(\text {p}K_{\text{a}}\) values only based on the closeness of the numerical values, without consideration of the relative population of microstates and microstate identities. Because of this, a microscopic \(\text {p}K_{\text{a}}\) value that describes a transition between very low population microstates (high energy tautomers) can be assigned to the experimental \(\text {p}K_{\text{a}}\) if it has the closest \(\text {p}K_{\text{a}}\) value. This is not helpful because, in reality, the microscopic \(\text {p}K_{\text{a}}\)  values that influence the observable macroscopic \(\text {p}K_{\text{a}}\) the most are the ones with higher microstate populations (transitions between low energy tautomers).

The number of unmatched predicted microscopic \(\text {p}K_{\text{a}}\)s is shown in the lower bar plots of Fig. 8b and c, to emphasize the large number of microscopic \(\text {p}K_{\text{a}}\) predictions submitted by many methods. In the case of microscopic \(\text {p}K_{\text{a}}\), the number of unmatched predictions does not indicate an error in the form of an extra predicted \(\text {p}K_{\text{a}}\), because the spectrophotometric experiments do not capture all microscopic \(\text {p}K_{\text{a}}\)s theoretically possible (transitions between all pairs of microstates that differ by one proton). \(\text {p}K_{\text{a}}\)s of transitions to and from very high energy tautomers are very hard to measure by experimental methods, including the most sensitive methods like NMR. Prediction of extra microscopic \(\text {p}K_{\text{a}}\)  values can cause underestimation of prediction errors when numerical matching algorithms such as Hungarian matching are used. We also checked how often Hungarian matching led to the correct matches between predicted and experimental \(\text {p}K_{\text{a}}\) in terms of the microstate pairs, i.e., how often the microstate pair of the Hungarian match recapitulates the dominant microstate pair of the experiment. The overall accuracy of microstate pair matching was found to be low for the SAMPL6 Challenge submission. Fig. S4 shows that for most methods the predicted microstate pair selected by the Hungarian match did not correspond to the experimentally-determined microstate pair. This means lower RMSE (better accuracy) performance statistics obtained from Hungarian matching are artificially low. This problem could be avoided by matching experimental and predicted values on the basis of microstate IDs, if experimental microscopic assignments are available.

Unfortunately, we were only able to perform this more reliable microstate-based analysis for a subset of compounds. The conclusions in this section reflect only eight compounds with limited structural diversity: Six molecules with 4-aminoquinazoline and two with benzimidazole scaffolds, with a total of 10 \(\text {p}K_{\text{a}}\) values. The sequences of dominant microstates for SM07 and SM14 were determined by NMR experiments directly [8], while dominant microstates of their derivatives were inferred by taking them as a reference (Fig. 8). Although we believe that microstate-based evaluation is more informative, the lack of a large experimental dataset limits the conclusions to a very narrow chemical diversity. Still, microstate-based matching revealed errors masked by \(\text {p}K_{\text{a}}\) value-based matching between experimental and predicted \(\text {p}K_{\text{a}}\)s.

Accuracy of \(\text {p}K_{\text{a}}\) predictions evaluated by microstate-based matching

Fig. 9
figure 9

Additional performance statistics for microscopic \(\text {p}K_{\text{a}}\) predictions for 8 molecules with experimentally determined dominant microstates. Microstate-based matching was performed between experimental \(\text {p}K_{\text{a}}\) values and predicted microscopic \(\text {p}K_{\text{a}}\)  values. Mean absolute error (MAE), mean error (ME), Pearson’s R2, and Kendall’s Rank Correlation Coefficient Tau (\(\tau \)) are shown, with error bars denoting 95% confidence intervals obtained by bootstrapping over challenge molecules. Methods are indicated by their submission IDs. Submissions are colored by their method categories. Refer to Table 1 for submission IDs and method names. Submissions 0wfzo, z3btx, 758j8, and hgn83 have MAE and ME values bigger than 10 \(\text {p}K_{\text{a}}\) units which are beyond the y-axis limits of subplots a and b. A large number and wide variety of methods have statistically indistinguishable performance based on correlation statistics (c and d), in part because of the relatively small dynamic range and small size of the set of 8 molecules

Both accuracy- and correlation-based statistics were calculated for the predicted microscopic \(\text {p}K_{\text{a}}\) values after microstate-based matching. RMSE, MAE, ME, R2, and Kendall’s Tau results of each method are shown in Figs. 8c and 9. A table of the calculated statistics can be found in Table S4. Due to the small number of data points in this set, correlation-based statistics have large uncertainties and thus have less utility for distinguishing better-performing methods. Therefore, we focused more on accuracy-based metrics for the analysis of microscopic \(\text {p}K_{\text{a}}\)s than correlation-based metrics. In terms of accuracy of predicted microscopic \(\text {p}K_{\text{a}}\) values, all three QSPR/ML based methods (nb016 (MoKa), hdiyq (Simulations Plus), 6tvf8 (OE Gaussian Process)), three QM-based methods (nb011 (Jaguar), ftc8w (EC-RISM/MP2/cc-pVTZ-P2-q-noThiols-2par), t8ewk (COSMOlogic_FINE17)), and one LFER method (v8qph (ACD/pKa GALAS)) achieved RMSE lower than 1 \(\text {p}K_{\text{a}}\) unit. The same six methods also have the lowest MAE.

Evaluation of dominant microstate prediction accuracy

For many computational chemistry approaches, including structure-based modeling of protein–ligand interactions, predicting the ionization state and the exact position of protons is necessary to establish what to include in the modeled system. In addition to being able to predict \(\text {p}K_{\text{a}}\) values accurately, we require \(\text {p}K_{\text{a}}\) prediction methods to be able to capture microscopic protonation states accurately. Even when the predicted \(\text {p}K_{\text{a}}\) value is accurate, the predicted protonation sites can be incorrect, leading to potentially large modeling errors in quantities such as the computed free energy of binding. Therefore, we assessed whether methods participating in the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge were correctly predicting the sequence of dominant microstates, i.e., dominant tautomers of each charge state observed between pH 2 and 12.

Figure 10 shows how well methods perform for predicting the dominant microstate, as analyzed for eight compounds with available experimental microstate assignments. The dominant microstate sequence is essentially the sequence of states that are most visible experimentally due to their higher fractional population and relative free energy within the tautomers at each charge. To extract the dominant tautomers predicted for the sequence of ionization states of each method, the relative free energy of microstates were first calculated at reference pH 0 [26]. To subsequently determine the dominant microstate at each formal charge, we selected the lowest energy tautomer for each ionization state based on the relative microstate free energies calculated at pH 0. The choice of reference pH is arbitrary, as relative free energy difference between tautomers of the same charge is always constant with respect to pH. This analysis was performed only for the charges − 1, 0, 1, and 2—the charge range captured by NMR experiments. Predicted and experimental dominant microstates were then compared for each charge state to calculate the fraction of correctly predicted dominant tautomers. This value is reported as the dominant microstate accuracy for all charge states (Fig. 10a).

Many of the methods which participated in the challenge made errors in predicting the dominant microstate. 10 QM and 3 QSPR/ML methods did not make any mistakes in dominant microstate predictions, although, they are expected to make mistakes in the relative population of tautomers (free energy difference between microstates) as reflected by the \(\text {p}K_{\text{a}}\) value errors. While all participating QSPR/ML methods showed good performance in dominant microstate prediction, LFER and some QM methods made mistakes. The accuracy of the predicted dominant neutral tautomers was perfect for all methods, except qsicn (Fig. 10b), but errors in predicting the major tautomer of charge + 1 were much more frequent. 22 out of 35 prediction sets made at least one error in predicting the lowest energy tautomer with + 1 charge. We didn’t include ionization states with charges − 1 and + 2 in this assessment because we had only one compound with these charges in the dataset. Nevertheless, errors in predicting the dominant tautomers seem to be a bigger problem for charged tautomers than the neutral tautomer.

Only eight compounds had data on the sequence of dominant microstates. Therefore conclusions on the performance of methods in terms of dominant tautomer prediction are limited to this limited chemical diversity (benzimidazole and 4-aminoquinazoline derivatives). We present this analysis as a prototype of how microscopic \(\text {p}K_{\text{a}}\) predictions should be evaluated. Hopefully, future evaluations can be performed with larger experimental datasets following the strategy we demonstrated here in order to reach broad conclusions about which methods are better for capturing dominant microstates and ratios of tautomers. Even if experimental microscopic \(\text {p}K_{\text{a}}\) measurement data is not available, experimental dominant tautomer determinations are still informative for assessing computational predictions.

Fig. 10
figure 10

Some methods predicted the sequence of dominant tautomers inaccurately. Prediction accuracy of the dominant microstate of each charged state was calculated using the dominant microstate sequence determined by NMR for 8 molecules as reference. a Dominant microstate accuracy vs. submission ID plot was calculated considering all the dominant microstates seen in the experimental microstate dataset of 8 molecules. b Dominant microstate accuracy vs. submission ID plot was generating considering only the dominant microstates of charge 0 and + 1 seen in the 8 molecule dataset. The accuracy of each molecule is broken out by the total charge of the microstate. c Dominant microstate prediction accuracy calculated for each molecule averaged over all methods. In b and c, the accuracy of predicting the dominant neutral tautomer is shown in blue and the accuracy of predicting the dominant + 1 charged tautomer is shown in green. Error bars denoting 95% confidence intervals obtained by bootstrapping. d Depiction of SM14 microstates for protonation states with +2, +1, and 0 charges. The dominant tautomer of each macroscopic protonation state is highlighted with a rectangle. Dominant microstates of each charge were determined based on NMR experiments

The most frequent misprediction was the major tautomer of the SM14 cationic form, as shown in Fig. 10. This figure shows the accuracy of the predicted dominant microstate calculated for individual molecules and for charge states 0 and + 1, averaged over all prediction methods. SM14, the molecule that exhibits the most frequent error in the predicted dominant microstate, has two experimental \(\text {p}K_{\text{a}}\) values that were 2.4 \(\text {p}K_{\text{a}}\) units apart, and we suspect that could be a contributor to the difficulty of predicting microstates accurately. Other molecules are monoprotic (4-aminoquinazolines) or their experimental \(\text {p}K_{\text{a}}\) values are very well separated (SM14, 4.2 \(\text {p}K_{\text{a}}\) units). It would be very interesting to expand this assessment to a larger variety of drug-like molecules to discover for which structures tautomer predictions are more accurate and for which structures computational predictions are not as reliable.

Consistently well-performing methods for microscopic \(\text {p}K_{\text{a}}\) predictions

We have identified different criteria for determining consistently top-performing predictions of microscopic \(\text {p}K_{\text{a}}\) than macroscopic \(\text {p}K_{\text{a}}\): having perfect dominant microstate prediction accuracy, unmatched \(\text {p}K_{\text{a}}\) count of 0, and ranking in the top 10 according to RMSE and MAE. Correlation statistics were not found to have utility for discriminating performance due to large uncertainties in these statistics for a small dataset of 10 \(\text {p}K_{\text{a}}\) values. Unmatched predicted \(\text {p}K_{\text{a}}\) count was also not considered since experimental data was only informative for the \(\text {p}K_{\text{a}}\) between dominant microstates and did not capture all the possible theoretical transitions between microstate pairs. Table 3 reports six methods that have consistent good performance according to many metrics, although evaluated only for the 8 molecule set due to limitations of the experimental dataset. Six methods were divided evenly between methods of QSPR/ML category and QM category. nb016 (MoKa), hdiyq (Simulations Plus), and 6tvf8 (OE Gaussian Process) were QSPR and ML methods that performed well. nb011 (Jaguar), 0xi4b(EC-RISM/B3LYP/6-311+G(d,p)-P2-phi-noThiols-2par), and cywyk (EC-RISM/B3LYP/6-311+G(d,p)-P2-phi-noThiols-2par) were QM predictions with linear empirical corrections with good performance with microscopic \(\text {p}K_{\text{a}}\) predictions.

The Simulations Plus \(\text {p}K_{\text{a}}\) prediction method is the only method that appeared to be consistently well-performing in both the assessment for macroscopic and microscopic \(\text {p}K_{\text{a}}\) prediction (gyuhx and hdiyq). However, it is worth noting that two methods that were in the list of consistently top-performing methods for macroscopic \(\text {p}K_{\text{a}}\) predictions lacked equivalent submissions of their underlying microscopic \(\text {p}K_{\text{a}}\) predictions, and therefore could not be evaluated at the microstate level. These methods were xmyhm (ACD/pKa Classic) and xvxzd(DSD-BLYP-D3(BJ)/def2-TZVPD//PBEh-3c[DCOSMO-RS] + RRHO(GFN-xTB[GBSA]) + Gsolv(COSMO-RS[TZVPD]) and linear fit).

Table 3 Top-performing methods for microscopic \(\text {p}K_{\text{a}}\) predictions based on consistent ranking within the Top 10 according to various statistical metrics calculated for 8 molecule dataset

How do \(\text {p}K_{\text{a}}\) prediction errors impact protein–ligand binding affinity predictions?

\(\text {p}K_{\text{a}}\) predictions provide a key input for computational modeling of protein–ligand binding with physical methods. The SAMPL6 \(\text {p}K_{\text{a}}\) Challenge focused only on small molecule \(\text {p}K_{\text{a}}\) prediction and showed how \(\text {p}K_{\text{a}}\) prediction accuracy observed can impact the modeling of ligands. Many affinity prediction methods such as docking, MM/PBSA, MM/GBSA, absolute or alchemical relative free energy calculation methods predict the affinity of the ligand to a receptor using a fixed protonation state for both ligand and receptor. These models can sensitively depend upon \(\text {p}K_{\text{a}}\) and dominant tautomer predictions for determining possible protonation states of the ligand in the aqueous environment and in a protein complex, as well as the free energy penalty to access those states [4]. The accuracy of \(\text {p}K_{\text{a}}\) predictions can become a limitation for the performance of physical models that try to quantitatively describe molecular association.

In terms of ligand protonation states, there are two ways in which \(\text {p}K_{\text{a}}\) prediction errors can influence the prediction accuracy for protein–ligand binding free energies as depicted in Fig. 11. The first scenario is when a ligand is present in aqueous solution in multiple protonation states (Fig. 11a). When only the minor aqueous protonation state contributes to protein–ligand complex formation, the overall binding free energy (\(\Delta G_{bind}\)) needs to be calculated as the sum of binding free energy of the minor state and the protonation penalty of that state (\(\Delta G_{prot}\)). \(\Delta G_{prot}\) is a function of both pH and \(\text {p}K_{\text{a}}\). A 1 unit of error in predicted \(\text {p}K_{\text{a}}\) would lead to 1.36 kcal/mol error in overall binding free energy if the protonation state with the minor population binds the protein and this minor protonation state is correctly selected to model the free energy of binding; if the incorrect dominant protonation state for the complex is selected, the dominant contribution to the free energy of binding may be missed entirely, leading to much larger modeling errors in the binding free energy. Other scenarios—in which multiple protonation states can be significantly populated in complex—can lead to more complex scenarios in which the errors in predicted \(\text {p}K_{\text{a}}\) propagate in more complex ways. The equations in Fig. 11a show the overall free energy for a simple thermodynamic cycle involving multiple protonation states.

In addition to the presence of multiple protonation states in the aqueous environment, multiple charge states can contribute to complex formation (Fig. 11b). Then, the overall free energy of binding needs to include a Multiple Protonation States Correction (MPSC) term (\(\Delta G_{corr}\)) [4]. MPSC is a function of pH, aqueous \(\text {p}K_{\text{a}}\) of the ligand, and the difference between the binding free energy of charged and neutral species (\(\Delta G_{bind}^{C} - \Delta G_{bind}^{N}\)) as shown in Fig. 11b.

Fig. 11
figure 11

Aqueous ligand \(\text {p}K_{\text{a}}\) can influence overall protein–ligand binding affinity. a When only the minor aqueous protonation state contributes to protein–ligand complex formation, the overall binding free energy (\(\Delta G_{bind}\)) needs to be calculated as the sum of binding affinity of the minor state and the protonation penalty of that state. b When multiple charge states contribute to complex formation, the overall free energy of binding includes a multiple protonation states correction (MPSC) term (\(\Delta G_{corr}\)). MPSC is a function of pH, aqueous \(\text {p}K_{\text{a}}\) of the ligand, and the difference between the binding free energy of charged and neutral species (\(\Delta G_{bind}^{C} - \Delta G_{bind}^{N}\))

Using the equations in Fig. 11b, we can model the true MPSC (\(\Delta G_{corr}\)) with respect to the difference between pH and the \(\text {p}K_{\text{a}}\) of the ligand to see when this value has a significant impact on the overall binding free energy. In Fig. 12, the true MPSC that must be added to \(\Delta G_{bind}^{N}\) is shown for ligands with varying binding affinity difference between protonation states (\(\Delta \Delta G = \Delta G_{bind}^{C} - \Delta G_{bind}^{N}\)). Fig. 12a shows the case of a monoprotic base in which the charged state has a lower affinity than the neutral state. Solid lines depict the accurate correction value. In cases where the \(\text {p}K_{\text{a}}\) is lower than the pH, the correction factor disappears as the ligand fully populates the neutral state (\(\Delta G_{bind} = \Delta G_{bind}^{N}\)). As the pH dips below the \(\text {p}K_{\text{a}}\), the charged state is increasingly populated and \(\Delta G_{corr}\) increases to approach \(\Delta \Delta G\).

It is interesting to note the pH-\(\text {p}K_{\text{a}}\) range over which \(\Delta G_{corr}\) changes significantly. It is often assumed that, for a basic ligand, if the \(\text {p}K_{\text{a}}\) of a ligand is more than 2 units higher than the pH, only 1% of the population is in the neutral state according to Henderson-Hasselbalch equation, and it is safe to approximate the overall binding affinity with \(\Delta G_{bind}^{C}\). Based on the magnitude of the relative free energy difference between ligand protonation states, this assumption is not always correct. As seen in Fig. 12a, the responsive region of \(\Delta G_{corr}\) can span 3 pH units for a system with \(\Delta \Delta G = 1 {\text{ kcal/mol}}\), or 5 pH units for a system with \(\Delta \Delta G = 4 {\text{ kcal/mol}}\). This highlights that the range of \(\text {p}K_{\text{a}}\) values that impact binding affinity predictions is wider than 2 pH units. Molecules with \(\text {p}K_{\text{a}}\)  values several units away from the physiological pH can still impact the overall binding affinity significantly due to the MPSC.

Despite the need to capture the contributions of multiple protonation states by including the MPSC in binding affinity calculations, inaccurate \(\text {p}K_{\text{a}}\) predictions can lead to errors in \(\Delta G_{corr}\) and overall free energy of binding prediction. In Fig. 12a dashed lines show predicted \(\Delta G_{corr}\) based on \(\text {p}K_{\text{a}}\) error of − 1 units. We have chosen a \(\text {p}K_{\text{a}}\) error of 1 unit as this is the average inaccuracy expected from the \(\text {p}K_{\text{a}}\) prediction methods based on the SAMPL6 Challenge. Underestimation of the \(\text {p}K_{\text{a}}\) causes the \(\Delta G_{corr}\) to be underestimated as well and will result in overestimated affinities (i.e., too negative binding free energy) for a varying range of pH − \(\text {p}K_{\text{a}}\) values depending on the binding affinity difference between protonation states(\(\Delta \Delta G\)). In Fig. 12b dashed lines show how the magnitude of the absolute error caused by calculating \(\Delta G_{corr}\) with an inaccurate \(\text {p}K_{\text{a}}\) varies with respect to pH. Different colored lines show simulated results with varying binding free energy differences between protonation states. For a system whose charged state has higher binding free energy than the neutral state (\(\Delta \Delta G\) = 2 kcal/mol), the absolute error caused by underestimated \(\text {p}K_{\text{a}}\) by 1 unit can be up to 0.9 kcal/mol. For a system whose charged state has an even lower affinity (more positive binding free energy) than the neutral state (\(\Delta \Delta G\) = 4 kcal/mol), the absolute error caused by underestimated \(\text {p}K_{\text{a}}\) by 1 unit can be up to 1.2 kcal/mol. The magnitude of errors contributing to overall binding affinity is too large to be neglected. Improving the accuracy of small molecule \(\text {p}K_{\text{a}}\) prediction methods can help to minimize the error in predicted MPSC.

With the current level of \(\text {p}K_{\text{a}}\) prediction accuracy as observed in SAMPL6 Challenge, is it advantageous to include the MPSC in affinity predictions that may include errors caused by \(\text {p}K_{\text{a}}\) predictions? We provide a comparison of the two choices to answer this question: (1) Neglecting the MPSC completely and assuming overall binding affinity is captured by \(\Delta G_{bind}^{N}\), (2) including MPSC with a potential error in overall affinity calculation. The magnitude of error caused by Choice 1 (ignoring MPSC) is depicted as a solid line in Fig. 12b and the magnitude of error caused by MPSC computed with inaccurate \(\text {p}K_{\text{a}}\) is depicted as dashed lines. What is the best strategy? Error due to choice 1 is always larger than error due to choice 2 for all pH–\(\text {p}K_{\text{a}}\) values. In this scenario, including the MPSC improves overall binding affinity prediction accuracy. The error caused by the inaccurate \(\text {p}K_{\text{a}}\) is smaller than the error caused by neglecting the MPSC.

We can also ask whether or not an MPSC calculated based on an inaccurate \(\text {p}K_{\text{a}}\) should be included in binding affinity predictions in different circumstances, such as underestimated or overestimated \(\text {p}K_{\text{a}}\) values and charged states with higher or lower affinities than the neutral states. We tried to capture these circumstances in four quadrants of Fig. 12. In the case of overestimated \(\text {p}K_{\text{a}}\) values (Fig. 12e–h), it can be seen that for most of the pH–\(\text {p}K_{\text{a}}\) range, it is more advantageous to include the predicted MPSC in affinity calculations, except a smaller window where the opposite choice would be more advantageous. For instance, for the system with \(\Delta \Delta G\) = 2 kcal/mol and overestimated \(\text {p}K_{\text{a}}\) (Fig. 12E) for the pH–\(\text {p}K_{\text{a}}\) region between − 0.5 and 2, including the predicted \(\Delta G_{corr}\) introduces more error than ignoring the MPSC.

In practice, we normally do not know the exact magnitude or the direction of the error of our predicted \(\text {p}K_{\text{a}}\). Therefore, using simulated MPSC error plots to decide when to include MPSC in binding affinity predictions is not possible. However, based on the analysis of a case with 1 unit of \(\text {p}K_{\text{a}}\) error, including the MPSC correction would be more often than not helpful in improving binding affinity predictions. The detrimental effect of \(\text {p}K_{\text{a}}\) inaccuracy is still significant. Hopefully, future improvements in \(\text {p}K_{\text{a}}\) prediction methods will improve the accuracy of the MPSC and binding affinity predictions of ligands which have multiple protonation states that contribute to aqueous or complex populations. Being able to predict \(\text {p}K_{\text{a}}\)  values with 0.5 units accuracy, for example, would significantly aid binding affinity models in computing more accurate MPSC terms.

The whole analysis presented in this section assumes that at least the dominant protonation state of the ligand is correctly included in the modeling of the protein–ligand complex. We have not discussed the case of omitting this dominant state from the free energy calculations entirely when it is erroneously predicted to be a minor state in solution. Such a mistake could be the most problematic, and the errors in estimated binding free energy could be very large.

Fig. 12
figure 12

Inaccuracy of \(\text {p}K_{\text{a}}\) prediction (± 1 unit) affects the the accuracy of MPSC and overall protein–ligand binding free energy calculations to varying degrees based on aqueous \(\text {p}K_{\text{a}}\) and relative binding affinity of individual protonation states (\(\Delta \Delta G = \Delta G_{bind}^{C} - \Delta G_{bind}^{N}\)). All calculations are made for 25\(^{\circ }\)C, and a ligand with a single basic titratable group. a, c, e, and g MPSC (\(\Delta G_{corr}\)) calculated with true vs. inaccurate \(\text {p}K_{\text{a}}\). b, d, f, and h Comparison of the absolute error to \(\Delta G_{bind}\) caused by ignoring the MPSC completely (solid lines) vs. calculating MPSC based in inaccurate \(\text {p}K_{\text{a}}\) value (dashed lines). These plots provide guidance on when it is beneficial to include MPSC correction based on \(\text {p}K_{\text{a}}\) error, pH–\(\text {p}K_{\text{a}}\), and \(\Delta \Delta G\)

Take-away lessons from SAMPL6 \(\text {p}K_{\text{a}}\) Challenge

The SAMPL6 \(\text {p}K_{\text{a}}\) Challenge showed that, in general, \(\text {p}K_{\text{a}}\) prediction accuracy of computational methods is lower than expected for drug-like molecules. Our expectation prior to the blind challenge was that well-developed methods would achieve prediction errors as low was 0.5 \(\text {p}K_{\text{a}}\) units, and make reliable predictions of dominant charge and tautomer states in solution. There are many factors that complicate predicting \(\text {p}K_{\text{a}}\) values of drug-like molecules: multiple titratable sites, including tautomerization, frequent presence of heterocycles, and extended conjugation patterns, as well as high numbers of rotatable bonds and the possibility of intramolecular hydrogen bonds. Macroscopic \(\text {p}K_{\text{a}}\) predictions have not yet reached experimental accuracy (where the inter-method variability of macroscopic \(\text {p}K_{\text{a}}\) measurements is around 0.5 \(\text {p}K_{\text{a}}\) units [23]). There was not a single method in the SAMPL6 Challenge that achieved RMSE around 0.5 or lower for macroscopic \(\text {p}K_{\text{a}}\) predictions for the 24 molecule set of kinase inhibitor fragment-like molecules. Smaller RMSEs were observed in the microscopic \(\text {p}K_{\text{a}}\) evaluation section of this study for some methods; however, the 8 molecule set used for that analysis poses a very limited dataset to reach conclusions about general expectations for drug-like molecules.

As the majority of experimental data was in the form of macroscopic \(\text {p}K_{\text{a}}\) values, we had to adopt a numerical matching algorithm (Hungarian matching) to pair predicted and experimental values to calculate performance statistics of macroscopic \(\text {p}K_{\text{a}}\) predictions. Accuracy, correlation, and extra/missing \(\text {p}K_{\text{a}}\) prediction counts were the main metrics for macroscopic \(\text {p}K_{\text{a}}\) evaluations. An RMSE range of 0.7 to 3.2 \(\text {p}K_{\text{a}}\) units was observed for all methods. Only five methods achieved RMSE between 0.7 and 1 \(\text {p}K_{\text{a}}\) units, while an RMSE between 1.5 and 3 log units was observed for the majority of methods. All four methods of the LFER category and three out of 5 QSPR/ML methods achieved RMSE less than 1.5 \(\text {p}K_{\text{a}}\) units. All the QM methods that achieved this level of performance included linear empirical corrections to rescale and unbias their \(\text {p}K_{\text{a}}\) predictions.

Based on the consideration of multiple error metrics, we compiled a shortlist of consistently-well performing methods for macroscopic \(\text {p}K_{\text{a}}\) evaluations. Two methods from QM+LEC methods, one QSPR/ML, two empirical methods achieved consistent performance according to many metrics. The common features of the two empirical methods were their large training sets (16,000–17,000 compounds) and commercial nature.

There were four submissions of QM-based methods that utilized the COSMO-RS implicit solvation model. While three of these achieved the lowest RMSE among QM-based methods (xvxzd, yqkga, and 8xt50) [46], one of them showed the highest RMSE (0hxtm (COSMOtherm_FINE17)). The comparison of these methods indicates that capturing the conformational ensemble of microstates, using high-level QM calculations, and including RRHO corrections contribute to better macroscopic \(\text {p}K_{\text{a}}\) predictions. Linear empirical corrections applied QM calculations improved results, especially when the linear correction is calibrated for an experimental dataset using the same level of theory as the deprotonation free energy predictions (as in xvxzd). This challenge also points to the advantage of the COSMO-RS solvation approach compared to other implicit solvent models.

Molecules that posed greater difficulty for \(\text {p}K_{\text{a}}\) predictions were determined by comparing the macroscopic \(\text {p}K_{\text{a}}\) prediction accuracy of each molecule averaged over all methods submitted to the challenge. \(\text {p}K_{\text{a}}\) prediction errors were higher for compounds with sulfur-containing heterocycles, iodo, and bromo groups. This trend was also conserved when only QM-based methods were analyzed. The SAMPL6 \(\text {p}K_{\text{a}}\) dataset consisted of only 24 small molecules which limited our ability to statistically confirm this conclusion, however, we believe it is worth reporting molecular features that coincided with larger errors even if we can not evaluate the reason for these failures.

Utilizing a numerical matching algorithm to pair experimental and predicted macroscopic \(\text {p}K_{\text{a}}\) values was a necessity, however, this approach did not capture all aspects of prediction errors. Computing the number of missing or extra \(\text {p}K_{\text{a}}\) predictions remaining after Hungarian matching provided a window for observing macroscopic \(\text {p}K_{\text{a}}\) prediction errors such as the number of macroscopic transitions or ionization states expected in a pH interval. In \(\text {p}K_{\text{a}}\) evaluation studies, it is typical to just focus on \(\text {p}K_{\text{a}}\) value errors evaluated after matching and to ignore \(\text {p}K_{\text{a}}\) prediction errors that the matching protocol can not capture [50,51,52,53,54]. Frequently ignored prediction errors include predicting missing or extra \(\text {p}K_{\text{a}}\)s and failing to predict the correct charge states. The SAMPL6 \(\text {p}K_{\text{a}}\) Challenge results showed sporadic presence of missing \(\text {p}K_{\text{a}}\) predictions and very frequent tendency to make extra \(\text {p}K_{\text{a}}\) predictions. Both indicate failures to capture the correct ionization states. The traditional way of evaluating \(\text {p}K_{\text{a}}\)s that only focuses on the \(\text {p}K_{\text{a}}\) value error after some sort of numerical match between predictions and experimental values may have motivated these types of errors as there would be no penalty for missing a macroscopic deprotonation and predicting an extra one. This problem does not seem to be specific to any method category.

We used the eight molecule subset of SAMPL6 compounds with NMR-based dominant microstate sequence information to demonstrate the advantage of evaluating \(\text {p}K_{\text{a}}\) prediction on the level of microstates. Comparison of statistics computed for the 8 molecule dataset by Hungarian matching and microstate-based matching showed how Hungarian matching, despite being the best choice when only numerical matching is possible, can still mask errors in \(\text {p}K_{\text{a}}\) predictions. Errors computed by microstate-based matching were larger compared to numerical matching algorithms in terms of RMSE. Microscopic \(\text {p}K_{\text{a}}\) analysis with numerical matching algorithms may mask errors due to the higher number of guesses made. Numerical matching based on \(\text {p}K_{\text{a}}\) values also ignores information regarding the relative population of states. Therefore, it can lead to \(\text {p}K_{\text{a}}\)s defined between very low energy microstate pairs to be matched to the experimentally observable \(\text {p}K_{\text{a}}\) between microstates of higher populations. Of course, the predicted \(\text {p}K_{\text{a}}\) value could be correct however the predicted microstates would be wrong. Such mistakes caused by Hungarian matching were observed frequently in SAMPL6 results, and therefore we decided microstate-based matching of \(\text {p}K_{\text{a}}\)values provides a more realistic picture of method performance.

Some QM and LFER methods made mistakes in predicting the dominant tautomers of the ionization states. Dominant tautomer prediction seemed to be particularly difficult for charged tautomers compared with neutral tautomers. The easiest way to extract the dominant microstate sequence from predictions was to calculate the relative free energy of microstates at any reference pH, determining the lowest free energy state in each ionization state. Errors in dominant microstate predictions were very rare for neutral tautomers, but more frequent in cationic tautomers with + 1 charge of the 8 molecule set. SM14 was the molecule with the lowest dominant microstate prediction accuracy, while dominant microstates predictions for SM15 were perfect for all molecules. SM14 and SM15 both possess two experimental \(\text {p}K_{\text{a}}\)s and a benzimidazole scaffold. The difference between them is the distance between the experimental \(\text {p}K_{\text{a}}\) values, which is smaller for SM14. These results make sense from the perspective of relative free energies of microstates. Closer \(\text {p}K_{\text{a}}\) values mean that the free energy difference between different microstates is smaller for SM14, and therefore any error in predicting the relative free energy of tautomers is more likely to cause reordering of relative populations of microstates and impact the accuracy of dominant microstate predictions. It would have been extremely informative to evaluate the tautomeric ratios and relative free energy predictions of microstates, however, the experimental data needed for this approach was not available. Tautomeric ratios could not be measured by the experimental methods available to us. Resolving tautomeric ratios would require extensive NMR measurements, but these measurements can suffer from lower accuracy especially when the free energy difference between tautomers is large.

The overall assessment of the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge captured non-stellar performance for microscopic and macroscopic \(\text {p}K_{\text{a}}\) predictions which can be detrimental to the accuracy of protein–ligand affinity predictions and other pH-dependent physicochemical property predictions such as distribution coefficients, membrane permeability, and solubility. Protein–ligand binding affinity predictions utilize \(\text {p}K_{\text{a}}\) predictions in two ways: determination of the relevant aqueous microstates and quantification of the free energy penalty to reach these states. More accurate microscopic \(\text {p}K_{\text{a}}\) predictions are needed to be able to accurately incorporate multiple protonation state corrections (MPSC) into overall binding affinity calculations.

We simulated the effect of overestimating or underestimating \(\text {p}K_{\text{a}}\) of a ligand by one unit on overall binding affinity prediction for a ligand where both cation and neutral states contribute to binding affinity. A \(\text {p}K_{\text{a}}\) prediction error of this magnitude (assuming dominant tautomers were predicted correctly) could cause up to 0.9 and 1.2 kcal/mol error in overall binding affinity when the binding affinity of protonation states are 2 or 4 kcal/mol different, respectively. For the case of 4 kcal/mol binding affinity difference between protonation states, the pH–\(\text {p}K_{\text{a}}\) range that the error would be larger than 0.5 kcal/mol surprisingly spans around 3.5 pH units. The worse case, of course, is where there is a significant difference in binding free energy between the two protonation states, but we include the wrong one in our free energy calcuation. We demonstrated that the range of pH–\(\text {p}K_{\text{a}}\) value that the MPSC needs to be incorporated in binding affinity predictions can be wider than the widely assumed range of 2 pH units, based on the affinity difference between protonation states. At the level of 1 unit \(\text {p}K_{\text{a}}\) error, incorporating the MPSC would improve binding affinity predictions more often than not. If the microscopic \(\text {p}K_{\text{a}}\) could be predicted with 0.5 \(\text {p}K_{\text{a}}\) units of accuracy, MPSC calculations would be much more reliable.

There are multiple factors to consider when deciding which \(\text {p}K_{\text{a}}\) prediction method to utilize. These factors include the accuracy of microscopic and macroscopic \(\text {p}K_{\text{a}}\) values, the accuracy of the number and the identity of ionization states predicted within the experimental pH interval, the accuracy of microstates predicted within the experimental pH interval, the accuracy of tautomeric ratio (i.e., relative free energy between microstates), how costly is the calculation in terms of time and resources, and whether one has access to software licenses that might be required.

All of the top-performing empirical methods were developed as commercial software that requires a license to run, and there were not any open-source alternatives for empirical \(\text {p}K_{\text{a}}\) predictions. Since the completion of the blind challenge, two publications reported open-source machine learning-based \(\text {p}K_{\text{a}}\) prediction methods, however, one can only predict the most acidic or most basic macroscopic \(\text {p}K_{\text{a}}\) values of a molecule [55] and the second one is only trained for predicting \(\text {p}K_{\text{a}}\) values of monoprotic molecules [56]. Recently, a \(\text {p}K_{\text{a}}\) prediction methodology was published that describes a mixed approach of semi-empirical QM calculations and machine learning that can predict macroscopic \(\text {p}K_{\text{a}}\)s of both mono- and polyprotic species [57]. The authors reported RMSE of 0.85 for the retrospective analysis performed on the SAMPL6 dataset.

Suggestions for future blind challenge design and evaluation of \(\text {p}K_{\text{a}}\) predictions

This analysis helped us understand the current state of the field and led to many lessons informing future SAMPL challenges. We believe the greatest benefit can be achieved if further iterations of small molecule \(\text {p}K_{\text{a}}\) prediction challenges can be organized, creating motivation for improving protonation state prediction methods for drug-like molecules. In future challenges, it is desirable to increase chemical diversity to cover more common scaffolds [58] and functional groups [59] seen in drug-like molecules, gradually increasing the complexity of molecules.

Microscopic \(\text {p}K_{\text{a}}\) measurements are needed for careful benchmarking of \(\text {p}K_{\text{a}}\) predictions for multiprotic molecules.

Future challenges should promote stringent evaluation for \(\text {p}K_{\text{a}}\) prediction methods from the perspective of microscopic \(\text {p}K_{\text{a}}\) and microstate predictions. It is necessary to assess the capability of \(\text {p}K_{\text{a}}\) prediction methods to capture the free energy profile of microstates of multiprotic molecules. This is critical because \(\text {p}K_{\text{a}}\) predictions are often utilized to determine relevant protonation states and tautomers of small molecules that must be captured in other physical modeling approaches, such as protein–ligand binding affinity or distribution coefficient predictions. Different tautomers can have different binding affinities and partition coefficients.

In this paper, we demonstrated how experimental microstate information can guide the analysis further than the typical \(\text {p}K_{\text{a}}\) evaluation approach that has been used so far. The traditional \(\text {p}K_{\text{a}}\) evaluation approach focuses solely on the numerical error of the \(\text {p}K_{\text{a}}\) values and neglects the difference between macroscopic and microscopic \(\text {p}K_{\text{a}}\) definitions. This is mainly caused by the lack of \(\text {p}K_{\text{a}}\) datasets with microscopic detail. To improve \(\text {p}K_{\text{a}}\) and protonation state predictions for multiprotic molecules, it is necessary to embrace the difference between macroscopic and microscopic \(\text {p}K_{\text{a}}\) definitions and select strategies for experimental data collection and prediction evaluation accordingly. In the SAMPL6 Challenge, the analysis was limited by the availability of experimental microscopic data as well. As is usually the case, macroscopic \(\text {p}K_{\text{a}}\) values were abundant (24 molecules) and limited data on microscopic states was available (8 molecules), although the latter opened new avenues for evaluation. For future blind challenges for multiprotic compounds, striving to collect experimental datasets with microscopic \(\text {p}K_{\text{a}}\)s would be very beneficial, despite the high cost of these measurements. Benchmark datasets of microscopic \(\text {p}K_{\text{a}}\)  values with assigned microstates are currently missing because experimental determination of these are much more expensive and time-consuming than macroscopic \(\text {p}K_{\text{a}}\) measurements. This limits the ability to improve \(\text {p}K_{\text{a}}\) and tautomer prediction methods for multiprotic molecules. If the collection of experimental microscopic \(\text {p}K_{\text{a}}\)s is not possible due to time and resource costs of such NMR experiments, at least supplementing the more automated macroscopic \(\text {p}K_{\text{a}}\) measurements with NMR-based determination of the dominant microstate sequence or tautomeric ratios of each ionization state can create very useful benchmark datasets. This supplementary information can allow microstate-based assignment of experimental to predicted \(\text {p}K_{\text{a}}\)  values and a more realistic assessment of method performance.

Evaluation strategy for \(\text {p}K_{\text{a}}\) predictions must be determined based on the nature of experimental \(\text {p}K_{\text{a}}\) measurements available.

If the only available experimental data is in the form of macroscopic \(\text {p}K_{\text{a}}\) values, the best way to evaluate computational predictions is by calculating predicted macroscopic \(\text {p}K_{\text{a}}\) from microscopic \(\text {p}K_{\text{a}}\) predictions. With the conversion of microscopic \(\text {p}K_{\text{a}}\) to macroscopic \(\text {p}K_{\text{a}}\)s, all structural information about the titration site is lost, and the only remaining information is the total charge of macroscopic ionization states. Unfortunately, most macroscopic \(\text {p}K_{\text{a}}\) measurements—including potentiometric and spectrophotometric methods—do not capture the absolute charge of the macrostates. The spectrophotometric method does not measure charge at all. The potentiometric method can only capture the relative charge changes between macrostates. Only pH-dependent solubility-based \(\text {p}K_{\text{a}}\) estimations can differentiate neutral and charged states from one another. It is, therefore, very common to have experimental datasets of macroscopic \(\text {p}K_{\text{a}}\) without any charge or protonation position information regarding the macrostates. This causes an issue of assigning predicted and experimental \(\text {p}K_{\text{a}}\) values before any error statistics can be calculated.

As delineated by Fraczkiewicz  [23], the fairest and most reasonable solution for the \(\text {p}K_{\text{a}}\) matching problem involves an assignment algorithm that preserves the order of predicted and experimental microstates and uses the principle of smallest differences to pair values. We recommend Hungarian matching with a squared-error penalty function. The algorithm is available in SciPy package (scipy.optimize.linear_sum_assignment) [35]. In addition to the analysis of numerical error statistics following Hungarian matching, at the very least, the number of missing and extra \(\text {p}K_{\text{a}}\) predictions must be reported based on unmatched \(\text {p}K_{\text{a}}\) values. Missing or extra \(\text {p}K_{\text{a}}\) predictions point to a problem with capturing the right number of ionization states within the pH interval of the experimental measurements. We have demonstrated that for microscopic \(\text {p}K_{\text{a}}\) predictions, performance analysis based on Hungarian matching results in overly optimistic and misleading results—instead the employed microstate-based matching provided a more realistic assessment when microstate data is available.

Lessons from the first \(\text {p}K_{\text{a}}\) blind challenge will guide future decisions on challenge rules, prediction reporting formats, and challenge inputs.

We solicited three different submission types in SAMPL6 to capture all the necessary information related to \(\text {p}K_{\text{a}}\) predictions. These were (1) macroscopic \(\text {p}K_{\text{a}}\) values, (2) microscopic \(\text {p}K_{\text{a}}\) values and microstate pair identities, and (3) fractional population of microstates with respect to pH. We realized later that collecting fractional populations of microstates was redundant since microscopic \(\text {p}K_{\text{a}}\) values and microstate pairs capture all the necessary information to construct fractional population vs. pH curves  [26]. Only microscopic and macroscopic \(\text {p}K_{\text{a}}\) values were used for the challenge analysis presented in this paper.

While exploring ways to evaluate SAMPL6 \(\text {p}K_{\text{a}}\) Challenge results, we developed a better way to capture microscopic \(\text {p}K_{\text{a}}\) predictions, as presented in [26]. This alternative reporting format consists of reporting the charge and relative free energy of microstates with respect to an arbitrary reference microstate and pH. This approach presents the most concise method of capturing all necessary information regarding microscopic \(\text {p}K_{\text{a}}\) predictions and allows calculation of predicted microscopic \(\text {p}K_{\text{a}}\)s, microstate population with respect to pH, macroscopic \(\text {p}K_{\text{a}}\)  values, macroscopic population with respect to pH, and tautomer ratios. Still, there may be methods developed to predict macroscopic \(\text {p}K_{\text{a}}\)s directly instead of computing them from microstate predictions that justifies allowing a macroscopic \(\text {p}K_{\text{a}}\) reporting format. In future challenges, we recommend collecting \(\text {p}K_{\text{a}}\) predictions with two submission types: (1) macroscopic \(\text {p}K_{\text{a}}\) values together with the charges of the macrostates and (2) microstates, their total charge, and relative free energies with respect to a specified reference microstate and pH. This approach is being used in SAMPL7.

In SAMPL6, we provided an enumerated list of microstates and their assigned microstate IDs because we were worried about parsing submitted microstates in SMILES from different sources correctly. There were two disadvantages to this approach. First, this list of enumerated microstates was used as input by some participants which was not our intention. (Challenge instructions requested that predictions should not rely on these microstate lists and only use them for matching microstate IDs.) Second, the first iteration of enumerated microstates was not complete. We had to add new microstates and assign them microstate IDs for a couple of rounds until reaching a complete list. In future challenges, a better way of handling the problem of capturing predicted microstates would be asking participants to specify the predicted protonation states themselves and assigning identifiers after the challenge deadline to aid comparative analysis. This would prevent the partial unblinding of protonation states and allow the assessment of whether methods can predict all the relevant states independently, without relying on a provided list of microstates. Predicted states can be submitted as mol2 files that represent the microstate with explicit hydrogens. The organizers must only provide the microstate that was selected as the reference state for the relative microstate free energy calculations.

In the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge, there was not a requirement that participants should report predictions for all compounds. Some participants reported predictions for only a subset of compounds, which may have led these methods to look more accurate than others due to missing predictions. In the future, it will be better to allow submissions of only complete sets for a better comparison of method performance.

A wide range of methods participated in the SAMPL6 \(\text {p}K_{\text{a}}\) Challenge—from very fast QSPR methods to QM methods with a high-level of theory and extensive exploration of conformational ensembles. In the future, it would be interesting to capture computing costs in terms of average compute hours per molecule. This can provide guidance to future users of \(\text {p}K_{\text{a}}\) prediction methods for selection of which method to use.

Some molecules suffered from less accurate \(\text {p}K_{\text{a}}\) predictions than others in SAMPL6. To understand the reason for these failures better, it can be helpful to ask participants who submit empirical prediction methods to inspect their training sets for the presence of similar compounds and optionally report it.

It is advantageous to field associated challenges with common set of molecules for different physicochemical properties.

Future blind challenges can maximize learning opportunities by evaluating predictions of different physicochemical properties for the same molecules in consecutive challenges. In SAMPL6, we organized both \(\text {p}K_{\text{a}}\) and log P challenges. Unfortunately only a subset of compounds in the \(\text {p}K_{\text{a}}\) datasets were suitable for the potentiometric log P measurements [8]. Still, comparing prediction performance of common compounds in both challenges can lead to beneficial insights especially for physical modeling techniques if there are common aspects that are beneficial or detrimental to prediction performance. For example, in SAMPL6 \(\text {p}K_{\text{a}}\) and log P Challenges COSMO-RS and EC-RISM solvation models achieved good performance. Having access to a variety of physicochemical property measurements can also help the identification of error sources. For example, dominant microstates determined for \(\text {p}K_{\text{a}}\) challenge can provide information to check if correct tautomers are modeling in a log P or log D challenge. \(\text {p}K_{\text{a}}\) prediction is a requirement for log D prediction and experimental \(\text {p}K_{\text{a}}\) values can help diagnosing the source of errors in log D predictions better. The physical challenges in SAMPL7, for which the blind portion of the challenges have just concluded on October 8th, 2020, follow this principle and include both \(\text {p}K_{\text{a}}\), log P, and membrane permeability properties for a set of monoprotic compounds. We hope that future \(\text {p}K_{\text{a}}\) challenges can focus on multiprotic drug-like compounds with microscopic \(\text {p}K_{\text{a}}\) measurements for an in-depth analysis.

Conclusion

The first SAMPL6 \(\text {p}K_{\text{a}}\) Challenge focused on molecules resembling fragments of kinase inhibitors, and was intended to assess the performance of \(\text {p}K_{\text{a}}\) predictions for drug-like molecules. With wide participation, we had an opportunity to prospectively evaluate \(\text {p}K_{\text{a}}\) predictions spanning various empirical and QM based approaches. In addition to community participants, a small number of popular \(\text {p}K_{\text{a}}\) prediction methods that were missing from blind submissions were added as reference calculations after the challenge deadline.

Practical experimental limitations restricted the overall size and microscopic information available for the blind challenge dataset [8]. The experimental dataset consisted of spectrophotometric measurements of 24 molecules, some of which were multiprotic. For a subset of molecules there was also NMR data to inform the dominant microstate sequence, though microscopic \(\text {p}K_{\text{a}}\) measurements were not performed. We conducted a comparative analysis of methods represented in the blind challenge in terms of both macroscopic and microscopic \(\text {p}K_{\text{a}}\) prediction performance avoiding any assumptions about the interpretation of experimental \(\text {p}K_{\text{a}}\)s.

Here, we used Hungarian matching to assign predicted and experimental values for the calculation of accuracy and correlation statistics, because the majority of experimental data was macroscopic \(\text {p}K_{\text{a}}\) values. In addition to evaluating error in predicted \(\text {p}K_{\text{a}}\) values, we also reported the macroscopic \(\text {p}K_{\text{a}}\) errors that were not captured by the match between experimental and predicted \(\text {p}K_{\text{a}}\) values. These were extra or missing \(\text {p}K_{\text{a}}\) predictions which are important indicators that predictions are failing to capture the correct ionization states.

We evaluated microscopic \(\text {p}K_{\text{a}}\) predictions utilizing the experimental dominant microstate sequence data of eight molecules. This experimental data allowed us to use microstate-based matching for evaluating the accuracy of microscopic \(\text {p}K_{\text{a}}\) values in a more realistic way. We have determined that QM and LFER predictions had lower accuracy in determining the dominant tautomer of the charged microstates than the neutral states. For both macroscopic and microscopic \(\text {p}K_{\text{a}}\) predictions we have determined methods that were consistently well-performing according to multiple statistical metrics. Focusing on the comparison of molecules instead of methods for macroscopic \(\text {p}K_{\text{a}}\) prediction accuracy indicated molecules with sulfur-containing heterocycles, iodo, and bromo groups suffered from lower \(\text {p}K_{\text{a}}\) prediction accuracy.

The overall performance of \(\text {p}K_{\text{a}}\) predictions as captured in this challenge is concerning for the application of \(\text {p}K_{\text{a}}\) prediction methods in computer-aided drug design. Many computational methods for predicting target affinities and physicochemical properties rely on \(\text {p}K_{\text{a}}\) predictions for determining relevant protonation states and the free energy penalty of such states. 1 unit of \(\text {p}K_{\text{a}}\)error is an optimistic estimate of current macroscopic \(\text {p}K_{\text{a}}\) predictions for drug-like molecules based on SAMPL6 Challenge where errors in predicting the correct number of ionization states or determining the correct dominant microstate were also common to many methods. In the absence of other sources of errors, we showed that 1 unit over- or underestimation of the \(\text {p}K_{\text{a}}\) of a ligand can cause significant errors in the overall binding affinity calculation due to errors in multiple protonation state correction factor.

The SAMPL6 GitHub Repository contains all information regarding the challenge structure, experimental data, blind prediction submission sets, and evaluation of methods. The repository will be useful for future follow up analysis and the experimental measurements can continue to serve as a benchmark dataset for testing methods.

In this article, we aimed to demonstrate not only the comparative analysis of the \(\text {p}K_{\text{a}}\) prediction performance of contemporary methods for drug-like molecules, but also to propose a stringent \(\text {p}K_{\text{a}}\) prediction evaluation strategy that takes into account differences in microscopic and macroscopic \(\text {p}K_{\text{a}}\) definitions. We hope that this study will guide and motivate further improvement of \(\text {p}K_{\text{a}}\) prediction methods.