Introduction

The Statistical Assessment of the Modeling of Proteins and Ligands (SAMPL) is a series of blind challenges aimed to improve the accuracy of computational models to predict physical properties relevant for modern rational drug design. Relevant data assessed during the course of the SAMPL challenge series are free energies of binding, hydration free energies, protonation equilibria, as well as partition and distribution equilibria. Among the latter, the distribution/partition coefficient of the active pharmaceutical ingredients (APIs) are of particular interest, as it provides guidance on the rational solvent selection for purification [1, 2] and can also be used to estimate the API’s distribution between the compartments of the human body [3]. Other areas where the distribution coefficients are of interest are the modeling of environmental properties like bioaccumulation [4], plastics recycling [5,6,7], and other types of technical extraction processes [8,9,10]. The assessment of prediction methods for practical application provides important information for the user who want to use prediction tools, but it also makes an important contribution to the development and evaluation of physical property prediction methods. Blind prediction challenges like SAMPL provide the rare opportunity to benchmark a model under real conditions and ensure a fair assessment of the different methods.

Since the COSMO-RS method was originally developed for the prediction of partition coefficients and Henry’s law constants [11], there is a long history of COSMO-RS contributions to the SAMPL blind prediction challenge series. The results underline the good quality of COSMO-RS derived molecular free energies in solution [12,13,14,15].

The recent SAMPL8 challenge contained two categories: the prediction of host–guest binding affinities and a physical property challenge, focusing on the prediction of distribution coefficients (logD) and acid dissociation constants (pKa) of a series of drug molecules. The experimental data collection of the latter part was provided by GlaxoSmithKline [1] and contains pKa data for twenty-three drug molecules, as well as logD values for eleven of those compounds. The distribution data was determined for seven bi-phasic systems, ranging from un-polar/polar combinations like heptane/water to more hydrophilic organic phases like methyl ethyl ketone or octanol, where the water solubility in the organic phase has to be taken into account. The SAMPL8 compound set is a diverse set (see Fig. 1) with only two groups of molecules that share the same scaffold with varying substituents. These are three 2-aminobenzimidazole derivatives (SM8-7, SM8-9, SM8-17) and two 2-chloroquinazolin-4-amines (SM8-15, SM8-18).

Here we report the results of the COSMO-RS method for the blind prediction of the SAMPL8 acid dissociation constants and distribution coefficients.

Fig. 1
figure 1

Molecules of the SAMPL8 logD and pKa challenge

Computational methods

The pKa and logD predictions were done with the Conductor like Screening Model for Realistic Solvation (COSMO-RS) [16]. COSMO-RS is a statistical thermodynamics model that describes intermolecular interactions from an ensemble of pair-wise interacting surface segments. The surface interaction terms are based on the screening charge density of the Conductor like Screening Model (COSMO) [17], which is a continuum solvation model commonly used in computational quantum chemistry. Thus, we have a two-step procedure that starts with the COSMO calculation, followed by the COSMO-RS prediction. The time-consuming quantum chemical COSMO calculations need to be done only once per compound. Technically, the results of the COSMO calculations are saved in the so-called COSMO files, which can be stored in a database. For a detailed description of the COSMO-RS method, please refer to [18] and [19].

We started our study with the generation of the (de)protonated states and tautomeric states of the SAMPL8 molecules and their conformers as described in reference 15. The only difference to the process described in reference 15 is that in addition to the tautomers, the (de)ported states were also generated. In this procedure, we used version 21 of COSMOconf [20] and COSMOquick [21, 22] software together with version 7.5 of the quantum chemistry software TURBOMOLE [23]. As a result, we obtained a set of COSMO files for the conformers of each microstate, which can be used for the following COSMO-RS calculations. For the logD predictions, the conformers of the neutral microstates were merged to form the conformer set of the compound. The number of microstates and conformers is given in Table S4 of the Supplementary Material. The COSMO files of the solvents used in this study were taken from the COSMObase 21 database [24]. All COSMO-RS calculations were done with the COSMOtherm program version 21using the BP_TZVPD_FINE_21 parametrization [25].

pKa

The aqueous dissociation constant of an acid (pKa), or the conjugated acid of a base (base-pKa) can be estimated from a linear free energy relationship (LFER) that connects the COSMO-RS predicted free energy \({\Delta }{G}_{i}\) to the dissociation constant pKa by two adjustable parameters, slope \({c}_{1}\) and shift \({c}_{0}\) [26, 27].

$$p{K}_{a}^{i}={c}_{1}{\Delta }{G}_{i}+{c}_{0}$$
(1)

For the pKa of an acid, \({\Delta }{G}_{i}\) is the free energy difference between the anion and the neutral state. In case of the base pKa, the free energy difference between the neutral state and the cation is used. The LFER parameters are specific to the solvent and the reacting system, i.e. an optimal correlation (and thus best prediction quality) is reached for two independent sets of LFER parameters, one for the (acid) pKa and one for the base pKa.

As in the SAMPL7 [28] challenge, the standard state free energies approach of Gunner et al. [29] should be used to describe the (de)protonation equilibria of the compounds under consideration. These relative free energies can be directly calculated from the LFER given in Eq. 1. The participants were asked to submit the relative free energies of the all microstates (neutral and ionic) with respect to a given reference state. In the evaluation, this information was used to calculate titration curves and the macroscopic pKa values as described in reference [29]. The assignment to the experimental data was done with the help of the popular transition, i.e. the protonation transition, which was found by the majority of the submissions [30].

In order to determine a consistent set of standard state free energies for the entire system, from the protonated cations to the anions, we used a new combined LFER model fit that covers both (acid) pKa and base pKa data. This model was called ”one pKa fit” in the submission, and will be referred to as the “unified” pKa model in the followingFootnote 1. Besides this, we prepared two additional submissions, one with the (acid) pKa LFER fit [26], and one with the base pKa LFER fit [27], both of which are available in COSMOtherm. In addition to the relative free energies, we have also submitted the optional macroscopic pKa data.

logD

The distribution coefficient, commonly used in the logarithmic form logD, describes the distribution of a compound between the two liquid phases of a bi-phasic system. It is defined as the ratio of the sum of the concentrations of all forms of the compound in the two phases. In contrast to the partition coefficient logP, which considers only the neutral solute, the logD considers the sum of dissociated and non-dissociated species.

$$log{D}^{\left(\text{2,1}\right)}={{log}}_{10}\left(\frac{[neutral+ionic\ in\ phase\ 2]}{[neutral+ionic\ in\ phase\ 1]}\right)$$
(2)

In this study, we examined 6 bi-phasic system that consist of an organic phase and a water phase and one system with two organic phases (cyclohexane/DMF).

Using the assumption that the dissociated species will not migrate into the organic solvent phase, we can calculate the fraction of the dissociated solute in the aqueous phase from the dissociation constant pKa and the pH. As a result, we obtain the logD from the logP and a dissociation correction for monoprotic acids or bases.

$${log}{D}^{(org.,water)}={log}{P}^{(org.,water)}-{{log}}_{10}\left(1+{10}^{{\varDelta }_{acid/base}}\right)$$
(3)
$${\varDelta }_{acid}=pH-pK_a\left(acid\right)$$
$${\varDelta }_{base}=pK_a\left(base\right)-pH$$

In case of the cyclohexane/DMF system we did not consider dissociation and the logD values correspond to the logP values of the neutral solutes. For the dissociation corrections, the experimental values provided by the SAMPL8 organizers were used (see Table 1). The pH = 8 was used for SM8-1,3,5,6 and pH = 3 for the rest.

The neglect of the dissociation in the organic phases with significant water content should be a valid assumption for common acids and bases [33]. Scott and Clymer estimate a logD error of ~ 0.3 log units for a pH - pKa difference of 3 units and a ratio of the partitioning of the neutral and the ionic form of 0.001 [33]. The pH - pKa difference of some cases of this study is higher than the 3 units. However, we do not have reliable data for the partitioning ratio to estimate the error.

Since the goal of the SAMPL 8 challenge was to predict the partition coefficient, which is defined for the salt-free phases, we did not consider the influence of counter ions discussed in the literature [31, 32]. The background ions introduced by the aqueous buffers used in the experimental work influence the distribution coefficients but the effect depend on the ion type [31] and thus contradicts a general definition of the distribution coefficients. Nevertheless, this discrepancy is a source of deviations between calculation and experiment.

Table 1 pKa and pH values used for the logD dissociation correction

The partition coefficient used in Eq. 3 was calculated from the difference of the solute chemical potentials at infinite dilution in the two phases\({\mu }_{i}^{(water/org.)}.\)

$${\text{log}}_{10}\left({P}_{i}^{\left(org.,water\right)}\right)=\left({\mu }_{i}^{\left(water\right)}-{\mu }_{i}^{\left(org.\right)}\right)/RTln\left(10\right)+{\text{log}}_{10}\left(VQ\right)$$
(4)

In Eq. 4 the quotient of the molar volumes of the solvent phases\(VQ={V}^{\left(water\right)}/{V}^{(org.)}\) is used to convert the partition coefficient from the mole fraction framework to a molar concentration-based definition. If available, experimental densities were utilized to obtain the molar solvent phase volume quotient \(VQ\). Otherwise, COSMOtherm estimates were used. For heptane/water and cyclohexane/water systems we assumed pure phases. For the remaining systems the mutual solubility of the solvents was taken into account. The phase compositions used are listed in Table 2. All logP values were calculated at 25 °C.

Table 2 Compositions of the bi-phasic systems used for the logP calculations

Results and discussion

pKa predictions

Among the six ranked SAMPL 8 entries, COSMO-RS (using the “unified” pKa LFER) and the Deep Gaussian Process submission provide the best predictions [30]. COSMO-RS showed the better mean absolute deviation (MAD) of 2.49 compared to 2.62 for the Deep Gaussian Process method. Considering the root mean square deviations (RMSD) the Deep Gaussian Process submission showed a lower deviation of 3.17 compared to COSMO-RS with 3.44. The SAMPL8 evaluation was done on a data set with 16 pKa values for 14 molecules and used the popular transition for the assignment of the experimental values. This assignment method was used because it provides an objective criterion for the mapping of calculation on experiment and should avoid predictions that are good for the wrong reasons, i.e. when a value is predicted correctly, but is assigned to the wrong transition (private communication D. L. Mobley 2022). It is assumed that the transition, neutral to anion or cation to neutral, found by the majority of the submissions is the dominant one for the experiment, which works well whenever the correct transition is popular, and badly when it is not. A summary of the SAMPL8 COSMO-RS results is given in Fig. 2.

Fig. 2
figure 2

COSMO-RS (unified model) results of the SAMPL8 pKa challenge evaluation of 16 pKa values for 14 molecules. Dashed lines mark the corridor of 1 pKa unit. The charge of the ionic sate is given in parentheses. a: Original SAMPL8 assignment of experiment and calculation. b: The assignment of experimental and calculated values was derived from the experimental pH dependent solubilities

The plot of the original assignment of transitions done by SAMPL8 depicted in Fig. 2a shows a number of compounds where the calculations deviate several pKa units from the experiment. Two of them, SM8-1 and SM8-20, have two experimental pKas, which were compared with the same calculated value, i.e. the same transition. For SM8-20 this results in a large deviation, whereas the deviations for SM8-1 are moderate. Since the deviations for SM8-14,17,21,22 were larger than expected, we checked the experimental titration curves (pH dependent solubilities) provided by the organizers [30]. To determine the transitions, we assumed that the neutral organic substances have a lower water solubility than the ionic species and exist in the pH range considered. Thus, they represent the minimum of the solubility curves. Starting from the solubility minimum, we now consider pKas that are at a lower pH value as base pKa (cation to neutral transition) and pKas that are at a higher pH as acid pKas (neutral to anion transition). With this assignment, we obtain the results shown in Fig. 2b. The large deviations have been reduced and the MAD has decreased to 1.47 (RMSD 1.65). Interestingly, the deviation for the base pKa of SM8-1 has increased.

At this point, we were interested in how the individual aqueous (acid) pKa and base pKa models of COSMOtherm perform on the SAMPL8 dataset. These two models were submitted outside the official evaluation as “COSMO-RS base pKa fit” for bases and “COSMO-RS pKa fit” for acids. Depending on the experiment-based transition assignment, the prediction of the (acid) pKa or base pKa model was compared with the experiment. The resulting dataset is bigger than the set used for the original SAMPL8 evaluation described above. The results for the 25 pKa values of 20 molecules are shown in Fig. 3 and Table S1 in the Supplementary Material.

Fig. 3
figure 3

COSMOtherm pKa predictions of 25 pKa values for 20 molecules. The transitions were derived from the experimental titration curves. The LFER fit for acids was used for the neutral to anion transition (circles) and the base pKa fit for the cation to neutral transition (triangles). The dashed lines mark the corridor of 1 pKa unit deviation

Apart from the SM8-1 cation to neutral transition, the predictions are in good agreement with the experiment. Nevertheless, due to the SM8-1 outlier the RMSD of 1.42 (MAD 0.95) is higher than expected. The RMSDs of the pKa and base pKa fit sets are 0.49 pKa units for the aqueous acid pKa LFER fit [26], and 0.56 pKa units for the base pKa LFER fit [27], respectively.

A careful check of the SM8-1 COSMO files did not reveal any significant anomalies. Two energetically unfavorable isomers that remained in the set do not change the results (see Table S3 of the Supplementary Material). We tried to optimize a zwitterionic tautomer, but could not detect a stable structure. Besides COSMO-RS there are two other SAMPL8 submissions (“EC_RISM” and “RFE-uESE-extra”) that predict two pKa values for the amphoteric SM8-1 compound. Similar to the COSMO-RS prediction, these submission also show negative pKa values (EC_RISM: -2.5; RFE-uESE-extra: -11.1) for the cation-neutral transition and thus a large deviations from the experimental value of 2.54. The re-measurement of the acid constants of SM8-1, 2, 5, 22 by Gretz, Czodrowski, Tielker and Kast from the TU Dortmund University [30] did not change the overall picture, but for SM8-1 they report a pKa for the neutral-anion transition of 3.99 and no base pKa above 2 pKa-units.

If we omit the SM8-1 base pKa outlier, we obtain a good agreement with the experiment with a MAD of 0.75 and an RMSD of 0.89. This result is within the expected range for molecules that are not present in the model’s fit set.

logD prediction

The SAMPL8 blind challenge counted 17 logD submissions [30]. Five of them were submitted to be included in the ranking. This subset consist of four methods classified as physical approaches that include quantum chemical calculations and one that uses a machine learning model built on molecular dynamics simulations. As can be seen from the overall deviations in the Supplementary Material Fig. S1, the COSMO-RS method provided the most accurate predictions in this challenge. The mean absolute deviation (MAD) over all bi-phasic systems and molecules is 1.07 log units and the root mean square deviation (RMSD) for the same data set is 1.36 log units. Looking at the MADs of all submissions and bi-phasic systems COSMO-RS can be found on rank one for the MEK/water and ethylacetate/water systems (see Supplementary Material Fig. S2). For cyclohexane/DMF, heptane/water, octanol/water and cyclohexane/water the COSMO-RS submission is on rank two, sharing this second place for cyclohexane/water with the “EC_RISM_logDexp” submission. The COSMO-RS predictions for the TBME/water system are better than the ones for cyclohexane/DMF, but the deviation of three other submission are smaller which puts the method in fourth place.

The MAD and RMSD deviations for the COSMO-RS logD predictions for all bi-phasic systems are given in Table 3. Figure 4 shows the comparison against the experiment. The calculated and experimental values are listed in the Supplementary Material Table S2.

Table 3 Mean absolute deviation (MAD) and root mean square deviation (RMSD) between COSMO-RS predictions and experiment
Fig. 4
figure 4

COSMO-RS logD predictions at 25 °C. Acids are marked by circles and bases by triangles. For the nonaqueous solvent system cyclohexane/DMF the pKa correction was not applied

Due to the “first principles based” nature of the COSMO-RS method it is not possible to provide error bars for individual data points. However, the error can be expected to be in the order of 0.5 log units (in the sense of a root mean square deviation) for partition coefficients [14].

The solvent system cyclohexane/DMF is the only non-aqueous system in SAMPL8, and therefore does not require a dissociation correction. The structures of the solutes SM8-1,6,16 are chemically diverse and belong to different compound classes. The mutual solubilities of cyclohexane and DMF were predicted by COSMOtherm (see Table 2). It was found that the logP of the solutes is not very sensitive with respect to the cyclohexane content of the DMF phase. Varying the cyclohexane content of the DMF phase (0–10 mol%), using a pure cyclohexane phase, does not change the results significantly, whereas the DMF content of the nonpolar cyclohexane phase has a large influence (see Fig. 5). This trend can be explained roughly from the phase and solute polarities. The solutes are all dipolar and protic, with the capability of hydrogen bond acceptance as well as donation. Thus, the solutes strongly prefer the polar (DMF rich) phase, which contains hydrogen bond acceptors that can interact with the solutes. The nonpolar (hexane rich) phase prefers to not contact with the polar solutes simply because of electrostatic repulsion. Thus, all solutes show negative partition coefficients in the pure DMF and hexane solvents. Adding hexane to the DMF phase does not change the picture qualitatively, as the nonpolar hexane does not offer any additional hydrogen bonding contacts for the solutes to interact with. This is different for the hexane phase, where adding DMF also adds a significant amount of hydrogen bonding sites that the solutes can interact with, making the hexane rich phase much more attractive for the solutes, and thus shifting the apparent partition coefficients towards positive values (see Fig. 5). For the actual partition equilibrium, the self-interaction of the solvent also play a role, but even from this simplified picture it becomes clear that the logP prediction significantly depends on the composition of the two phases. Using the COSMOtherm prediction to obtain the composition of the two phases thus provides a reasonable initial guess, but shows a similar error quantitatively, as the logP prediction itself, which may cause a substantial bias on the prediction, explaining the systematic shift seen for this solvent system. Hence, it is recommended to use experimental phase compositions, if available, and COSMOtherm predictions as a fallback if no experimental data is available.

Fig. 5
figure 5

Partition coefficient logP for the cyclohexane/DMF system using different solvent phase compositions. LLE denotes the phases given in Table 2

In this study we assumed that protonation and dissociation only take place in the water phase, not in the organic phase. Hence, the dissociation correction of Eq. 3 was applied to the water phase only. The actual pH values of the measurements, which were not known at the time of the challenge, differ slightly from the values used for the predictions [1]. Nevertheless, this small differences do not change the predictions significantly.

Since the neutral forms of the substances under consideration have low water solubilities, their distribution coefficient logD were determined at a pH value where the molecules were significantly ionized, to ensure that the solubility is large enough to be measureable [1]. The resulting dissociation corrections are listed in Table 1. All molecules have relatively large corrections, which are in the order of magnitude of the calculated logP values. The absolute ratio of the corrections and the logP values is between 0.5 and 3.5. Therefore, the errors of the pKa values play an important role when considering the deviation from the experimental logD values. The dissociation correction of SM8-1, which shows large deviations for four bi-phasic systems, changes from − 2.99 to -4.01, if we use the pKa value of 3.99 measured by Gretz, Czodrowski, Tielker and Kast [30]. Using this correction, the large deviations of SM8-1 for the organic/water systems reduces substantially, except for heptane/water, which is spot on using the SAMPL8 provided pKa values (see Fig. 6).

Fig. 6
figure 6

Experimental logD values and absolute deviations to experiment of SM8-1 in the different organic/water systems in log units. The logP corrections were calculated from the SAMPL8 pKa, or the pKa measured by Gretz, Czodrowski, Tielker and Kast [30]

Typically, the solubility of the organic solvent in the water phase is negligibly small, but the solubility of water in the organic phase can be significant. For the octanol/water system the equilibrium solubility of water was 27 mol%, for the ethylacetate/water system it was 20.6 mol%, and for MEK/water 35.1%. For the systems heptane/water, and cyclohexane/water pure solvent phases were assumed, as the mutual solubility of the nonpolar solvents and water is very small.

A further source of the unusual high deviation from prediction to the measured data in some solutes could be grounded in the possibility that metastable extraction equilibria could have been measured. As stated in reference 1, due to the high-throughput nature of the experiments not all solvent combinations could be pre-saturated .

Conclusion

We submitted three COSMO-RS based contributios for the SAMPL8 pKa challenge. Since the task was to predict the standard state free energies for the whole protonation equilibrium, our ranked submission was based on a new combined LFER model that covers the acid and the base pKa and can be used for the prediction of the required values (unified model). The macroscopic pKa values were calculated from the standard state free energies and compared with the experimental data. The assignment of the experimental and calculated values was made using the popular transition method. This COSMO-RS approach yields an RMSD deviation of 3.44 log units and is in the second place of the ranked submissions. A different assignment of experimental and calculated values, which depends on the experimental pH dependent solubilities reduces the RMSD to 1.65. We consider this method to be more consistent, as it is not based on the results of the submitted predictions. Besides the unified model we submitted the results of the COSMOtherm pKa and base pKa models. Together with the experiment-based assignment, these models yield and RMSD of 1.42 log units. This deviation is mainly due to the base pKa of SM8-1. The experimental value of this cation-neutral transition could not be confirmed by the re-measurement of Gretz, Czodrowski, Tielker and Kast and the omission of this outlier leads to an RMSD of 0.89. The results show that the separate LFER fits for pKa and base pKa of COSMOtherm are clearly advantageous and recommended for the calculations of the macroscopic pKa. However, if the entire (de)protonation equilibrium needs to be described consistently, a unified model is needed.

With an RMSD of 1.36 log units the COSMO-RS logD calculations were the most accurate predictions of the SAMPLE 8 challenge. Nevertheless, some predictions show unexpectedly large deviations from the experiment. The discussion of these deviations is difficult, as they depend on several factors. An important factor is the composition of the phases of the bi-phasic systems. In particular, the fraction of polar substance in the un-polar phase, e.g. the DMF fraction of the cyclohexane phase, has a major influence on the distribution coefficients. To avoid the additional error of predicting the phase composition, it is recommended to use the experimental phase composition whenever possible.

Another important point is the dissociation correction for the aqueous phases, which in some cases becomes dominant at the pH values used in SAMPL8. For a more accurate analysis of the deviations of the logD predictions, the experimental error of the pKa values used would have to be known. The same applies to the achievement of phase equilibrium during the high throughput measurements, which may be another source of error.