Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge

Bannan, Caitlin C.; Burley, Kalistyn H.; Chiu, Michael; Shirts, Michael R.; Gilson, Michael K.; Mobley, David L.

doi:10.1007/s10822-016-9954-8

Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge

Published: 27 September 2016

Volume 30, pages 927–944, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge

Download PDF

Caitlin C. Bannan¹,
Kalistyn H. Burley²,
Michael Chiu³,
Michael R. Shirts⁴,
Michael K. Gilson⁵ &
…
David L. Mobley ORCID: orcid.org/0000-0002-1083-5533^1,2

1755 Accesses
86 Citations
13 Altmetric
5 Mentions
Explore all metrics

Abstract

In the recent SAMPL5 challenge, participants submitted predictions for cyclohexane/water distribution coefficients for a set of 53 small molecules. Distribution coefficients (log D) replace the hydration free energies that were a central part of the past five SAMPL challenges. A wide variety of computational methods were represented by the 76 submissions from 18 participating groups. Here, we analyze submissions by a variety of error metrics and provide details for a number of reference calculations we performed. As in the SAMPL4 challenge, we assessed the ability of participants to evaluate not just their statistical uncertainty, but their model uncertainty—how well they can predict the magnitude of their model or force field error for specific predictions. Unfortunately, this remains an area where prediction and analysis need improvement. In SAMPL4 the top performing submissions achieved a root-mean-squared error (RMSE) around 1.5 kcal/mol. If we anticipate accuracy in log D predictions to be similar to the hydration free energy predictions in SAMPL4, the expected error here would be around 1.54 log units. Only a few submissions had an RMSE below 2.5 log units in their predicted log D values. However, distribution coefficients introduced complexities not present in past SAMPL challenges, including tautomer enumeration, that are likely to be important in predicting biomolecular properties of interest to drug discovery, therefore some decrease in accuracy would be expected. Overall, the SAMPL5 distribution coefficient challenge provided great insight into the importance of modeling a variety of physical effects. We believe these types of measurements will be a promising source of data for future blind challenges, especially in view of the relatively straightforward nature of the experiments and the level of insight provided.

Measuring experimental cyclohexane-water distribution coefficients for the SAMPL5 challenge

Article 07 October 2016

Prediction of cyclohexane-water distribution coefficients for the SAMPL5 data set using molecular dynamics simulations with the OPLS-AA force field

Article 31 August 2016

Blinded predictions of distribution coefficients in the SAMPL5 challenge

Article Open access 27 September 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

This year’s Statistical Assessment of Modeling of Proteins and Ligand (SAMPL) challenge focuses on prediction of cyclohexane–water distribution coefficients. The inclusion of distribution coefficients replaces the previous focus on hydration free energies which was a fixture of the past five challenges (SAMPL0-4) [1–7]. Due to a lack of ongoing experimental work to generate new data, hydration free energies are no longer a practical property to include in blind challenges. It has become increasingly difficult to find unpublished or obscure hydration free energies and therefore impossible to design a challenge focusing on target compounds, functional groups or chemical classes. But this type of data is extremely valuable—the past SAMPL challenges have driven real improvements in a variety of methods for calculating hydration free energies [1]—so we sought to include a similar physical property in SAMPL5. The organizers of SAMPL5 settled on cyclohexane–water distribution coefficients, and thanks to a partnership with Genentech, this led to a series of measurements on drug-like compounds, discussed in detail in this issue [8]. The experimental measurements are also straightforward enough that future distribution coefficient challenges can be deliberately designed to focus on issues that merit attention to move the field forward.

Partition and distribution coefficients are important physical properties [9, 10] which can provide a valuable opportunity for testing computational methods and molecular models. Distribution coefficients describe how all forms of a solute distributes itself across two immiscible solvents. In this case,

$$\begin{aligned} D = \frac{\sum _i [X_i]_{cyc}}{\sum _i [X_i]_{aq}} \end{aligned}$$

(1)

where $X_i$ represents a single protonation or tautomeric state of the solute in one of the solvents, and the sum runs over all protonation and tautomeric states [10]. Results are reported as the logarithm of this ratio ($\log D$). These are more complicated than partition coefficients ($\log P$), which measure the concentration of the neutral solute in both solvents [9]. Specifically, if only one tautomer is relevant, $\log P$ is proportional to the transfer free energy, and can be calculated from solvation free energies [11–21]. In contrast, all relevant charged and neutral forms of the solute will need to be included to accurately calculate $\log D$, which can be estimated from a calculated $\log P$ and the relative populations of protonation states and tautomers in each solvent. Thus, accurate tautomer enumeration in both solvents may be an important part of predicting $\log D$, introducing new complexities to the SAMPL challenge which were avoided in previous hydration free energy challenges.

Here we give an overview of the analysis done for the SAMPL5 challenge, including the compounds considered, overall performance of submissions, and the metrics used for analysis. We also include details for a set of reference calculations we performed estimating $\log D$ as the cyclohexane/water partition coefficient as well as a series of follow-up studies focusing on the importance of tautomers in estimating $\log D$. Overall, we believe the outcome of the present SAMPL5 challenge highlights the potential benefits of this type of experimental data to improve computational methods, force fields, sampling algorithms, and treatment of protonation states and tautomers. Many of these issues will be highly relevant for nominally more challenging problems, such as prediction of protein–ligand binding affinities.

Challenge logistics

SAMPL5 began on September 15, 2015 when the specifications and input files for the challenge were made available on the D3R website (http://drugdesigndata.org); these are also provided in the supporting information, made available on the University of California Dash (http://n2t.net/ark:/b7280/d1988w). The challenge deadline was February 2, 2016 and experimental results were provided to participants not long after. As in past SAMPL challenges, each group could submit multiple sets of predictions. There was also the option for participants’ identities to remain anonymous, although their results and method descriptions would still be made available. A total of 76 prediction sets from 18 participants or participating groups were submitted and assigned a random 2 digit ID number, 01–76, that will be used throughout this paper. Predictions were analyzed and overview statistics, as well as individual analysis of each submission by various error metrics (as detailed below) were returned to each participant. The challenge culminated with discussions of participants experiences and results at the 1st D3R Workshop at the University of California, San Diego March 9–11, 2016.

Table 1 A complete list of compounds used in the SAMPL5 challenge, sorted by batch

Full size table

For the prediction of distribution coefficients in SAMPL5, a total of 53 molecules were considered. They were assigned an identifier in the form SAMPL5_XXX and are pictured in Table 1. The 53 molecules were divided into batches 0, 1, and 2 containing 13, 20, and 20 molecules respectively. We wanted each batch to have a similar dynamic range and for the molecules to increase in size across batches, so on average the smallest molecules are in batch 0 and the largest in batch 2. To ensure each batch had adequate dynamic range, the molecular weight and estimated octanol/water partition coefficient were computed for each compound. These partition coefficients were estimated with OpenEye’s $\log P$ calculator. Molecules were then assigned to bins by estimated partition coefficient, and assigned to batches based on molecular weight. Specifically, the smallest molecules from each partition coefficient bin were added to batch 0, then batch 1, and the rest of the molecules comprise batch 2.

Participants could submit partial sets of predictions as long as they included full consecutive batches; that is, they could submit batch 0, batches 0 and 1, or batches 0, 1, and 2. The idea was that all participants should attempt predictions on the full set if at all possible, but grouping into batches would allow people with particularly demanding methods (such as polarizable force fields or methods requiring intensive quantum mechanics) to focus on smaller compounds and still be evaluated. Eight submissions from two participants included results for only batch 0, and an additional five submissions from two participants provided only batches 0 and 1. Here we focus on the results for the complete set of molecules (batches 0, 1, and 2). Separate analysis for data subsets is available in the supporting information on Dash

Participants were asked to report a cyclohexane/water distribution coefficient for each molecule. As discussed above, distribution coefficients are the ratio of concentrations for all forms of the solute in the cyclohexane and aqueous layers, at a specified pH. In this case, experiments were done with the water layer consisting of a buffered aqueous solution at pH 7.4. We also required participants to provide two estimates for uncertainty, a statistical uncertainty for their computational method and a model uncertainty that estimates agreement with experiment. The statistical uncertainty was intended to be the variation expected over repeated calculations of the same value. The model uncertainty, on the other hand, was intended to provide an estimate of how well the calculated value will agree with experiment. For example, in a recent study we computed cyclohexane/water partition coefficients using alchemical solvation free energy calculations in GROMACS where the statistical uncertainties were around 0.05 log units, but the root mean squared error was around 1.4 log units [23], so an appropriate estimated model uncertainty would have been 1.4 log units. A careful analysis of expected error could even yield model uncertainties which would vary based on the anticipated difficulty or complexity of a compound. Our interest in model uncertainties is in part based on the realization that an important part of creating predictive models is the ability to know when they will be unreliable or fail. Thus, analysis of model uncertainties is an important part of evaluating any model.

Analysis of submission performance

As in past SAMPL challenges, we considered a variety of error metrics in analyzing all predictions submitted to SAMPL5. Each error metric was calculated for all submissions, by batch, and distributed to challenge participants before the workshop. Here we will focus primarily on six error metrics: the root-mean-squared error (RMSE), average unsigned error (AUE), average signed error (ASE), Pearson’s R (R), Kendall’s tau (tau), and the ‘error slope’ explained in depth below. We also calculated the maximum absolute error and the percent of predictions with the correct sign, but these are not included in the analysis here. However, these metrics were provided to challenge participants and are available in the supporting information on Dash. The uncertainty in each metric was calculated as the standard deviation over 1000 bootstrap trials, where each trial consists of creating a ‘new’ dataset by sampling pairs of (predicted, calculated) values from the original set, with replacement. As described previously, this bootstrapping technique also included variation in the experimental values based on their reported uncertainties [1].

As discussed above, an important factor influencing the utility of a predictive tool is the ability of the tool to not only provide predictions but to predict the accuracy of those predictions—that is, how well the calculated values are likely to agree with experiment—not just its statistical error. To assess this, as in SAMPL4 [1], a quantile–quantile plot (QQ Plot) was created for each prediction set [24]. QQ Plots compare the fraction of a normal distribution within a specified number of standard deviations to the distribution of errors (calculated minus experiment) that are within that number of model uncertainties. For example, consider the number of predictions within one standard deviation of the expected value; if the samples are drawn from a normal distribution, then 0.68 of the values ought to fall within one standard deviation, so the value on the x-axis is 0.68. The value on the y-axis will represent the fraction of predictions that are within one model uncertainty of the experimental value. If the model uncertainty is accurate, then this also ought to correspond to a value of 0.68. A linear regression analysis helps summarize these results. The ‘error slope’ is the slope of the line comparing the fraction of predictions within a specified range of experiment to the expected fraction from a normal distribution. An error slope of greater than one indicates that the calculated values are within uncertainty of experiment more often than expected, or in other words the model uncertainty was overestimated. In contrast, an error slope less than one suggests the model uncertainty was underestimated.

We also attempted to identify any individual molecules where most of the methods failed to accurately estimate the distribution coefficient. To accomplish this, we analyzed all predictions on a molecule-by-molecule basis via our usual set of error metrics. Here we will primarily focus on just average unsigned error for molecules, but all other error metrics were provided to participants and are available in supporting information on Dash.

Reference calculations

We calculated distribution coefficients through a few different methods as a reference. K.H.B., a graduate student in the Mobley group, performed a set of blind calculations estimating the $\log D$ as a partition coefficient between cyclohexane and water calculated from solvation free energies. In addition, C.C.B. and D.L.M. performed post-challenge analysis of protonation and tautomeric states and used this to convert our calculated partition coefficients to distribution coefficients. We considered a null hypothesis where all molecules are assumed to distribute equally between cyclohexane and water. Many fast structure-based tools for octanol/water partition coefficients exist, and we used one of those to estimate partition coefficients, both with no correction for the fact that we are interested in cyclohexane, and with a small adjustment for this as discussed below.

Calculating partition coefficients from solvation free energies

We decided to estimate distribution coefficients via a $\log P$ calculation, by assuming only a single neutral tautomer of each solute, then calculating $\log P$ from a difference in solvation free energies. Before the challenge, each molecule was taken directly from the provided SMILES string. As demonstrated in the literature [11–21], partition coefficients are directly proportional to the difference between the solvation free energy for the solute into each solvent. We use previously established and automated protocols [23] to calculate the solvation free energy of each molecule into water and cyclohexane. Then the calculated partition coefficient was reported as an estimate for $\log D$.

To calculate solvation free energies, we used automated tools created by the Mobley lab. Molecular dynamics simulations were performed in GROMACS [25–31] with the General AMBER Force Field (GAFF) [32] with AM1-BCC charges [33, 34]. The TIP3P water model [35] was used for the aqueous phase. Topology and coordinate files for the solvated boxes with 1 solute molecule and 500 cyclohexane or 1000 water molecules were built using our Solvation Toolkit [23]. These files were then converted to AMBER, DESMOND, and LAMMPS formats and provided to SAMPL5 participants as reference calculations. The Solvation Toolkit takes advantage of many open source Python modules and is available at https://github.com/MobleyLab/SolvationToolkit. It converts SMILES strings or IUPAC names of any mixture of small organic compounds to parameterized molecules and builds topology and coordinate files for a variety of simulation packages. All molecular dynamics parameters are identical to previous studies [4, 23, 36]. The molecule is taken from the solvated box to a non-interacting gas phase in 20 lambda values. Solvation free energies are calculated with Alchemical Analysis tool [37] using the multi-state Bennett acceptance ratio to extract the free energy difference between the beginning and end state. The partition coefficient was calculated as the difference between the cyclohexane solvation free energy and the hydration free energy. The statistical uncertainty was reported as the propagated uncertainty from the solvation free energy calculations. The model uncertainty was estimated to be the same for all molecules and reported as the root-mean-squared error from a recent study on calculating cyclohexane/water partition coefficient, specifically 1.4 log units [23]. These reference calculations were assigned submission ID 39 and included in the error analysis performed on all submissions.

Simulation box size does not affect the calculated solvation free energy

Hydration free energies were previously shown to be independent of box size for box edges ranging from 2 to 9 nanometers, within calculated uncertainties [38]; however, here, because cyclohexane is much less polar, we had some concern that finite size effects could still be significant. To explore this, we performed tests in which we varied the simulation box size. Because more polar solutes are more likely to have substantial long range interactions, we calculated the dipole moment of each SAMPL5 molecule using the position and charges on atoms in the mol2 files. SAMPL5_024 had the largest dipole moment so it was used as the solute for the box size investigation. The solvation free energy calculations were set up as described above, changing the number of cyclohexane molecules from 100 to 500. Our calculations above are performed with lattice-sum (PME) treatment of coulomb interactions. It was the primary focus of this check and we included duplicate calculations for the smaller box sizes where the initial coordinate file was the same, but new velocities were generated for the equilibrium phases. We also repeated the solvation free energy calculations with reaction field coulomb interactions assigning the dielectric coefficient for cyclohexane, 2.0243 [39]. For both types of simulation, the calculated solvation free energy fluctuated around an average of 19.1 kcal/mol with no trend that would suggest solvation free energy depends on box size (Fig. 1). There are many other explanations for fluctuations in calculated solvation free energies, including sampling, that might account for the 0.48 kcal/mol range in the PME simulations. Ultimately, we find that box edge lengths from 2.64 to 4.54 nm have no significant effect on the calculated solvation free energies. This suggests that in the future, smaller box sizes could be used for computational efficiency. The input, results, molecular dynamics parameter and coordinate files, and tables of solvation free energies are available in the supporting information on Dash.

Consideration of tautomers after SAMPL5

To help understand how the results from our partition coefficient calculations could have been improved, we considered corrections for changes in the solutes’ protonation or tautomeric states. Distribution coefficients differ from partition coefficients in that they include all forms of the solute in both solvents. A common way to convert between experimentally measured distribution coefficients and partition coefficients is with pKa values for the solute [40]. This is a simple correction using the Henderson–Hasselbalch equation:

$$\begin{aligned} pH = pK_a + \log \frac{[X]}{[HX]} \end{aligned}$$

(2)

to relate the concentration of neutral species to the charged species at a given pH. This correction assumes the solute only has one other relevant protonation state and changes for acidic and basic molecules. Zwitterions and other neutral tautomers are not taken into account. The equation used to calculate a distribution coefficient ($\log D$) from a partition coefficient ($\log P$) for a basic solute (or X in Eq. 2) is below

$$\begin{aligned} \log D = \log P - \log(1+10^{pK_a-pH}) \end{aligned}$$

(3)

Alternatively for an acidic solute (or HX in Eq. 2) we would instead use:

$$\begin{aligned} \log D = \log P - \log(1+10^{pH-pK_a}) \end{aligned}$$

(4)

We use Schrödinger’s Epik tool [41–43] to estimate pKa values for each molecule according to experimental conditions. We then estimated $\log D$ using the equations above, accounting for just one change in protonation state, meaning each solute was taken to be either acidic or basic. For acidic solutes, the smallest acidic pKa was used with Eq. 4, oppositely for basic solutes the largest basic pKa was used with Eq. 3 to estimate $\log D$ from $\log P$.

Using pKa values only accounts for one change in protonation, whereas a correct distribution coefficient should include all relevant tautomers and protonation states of the molecule in both solvents. To account for all other tautomer states, we used Schrödinger’s LigPrep [44] to enumerate tautomers for each molecule in the aqueous solution. The results of the enumeration can include an energetic “state penalty” calculated with Epik which relates the population of that tautomer to all others. This state penalty can be converted into log units and used as a correction term to convert $\log P$ to $\log D$:

$$\begin{aligned} \log D = \log P + \frac{-E_{state\ penalty}}{k_BT \ln (10)} \end{aligned}$$

(5)

where $k_B$ is Boltzmann constant and T is temperature. LigPrep can only perform the tautomer enumeration with water or DMSO as a solvent, so we were unable to predict tautomers in cyclohexane. Therefore both of these corrections account for the protonation or tautomer states only in the aqueous layer and assume the tautomer remains fixed in cyclohexane as the one used in the initial simulation. In the results section below, the corrections performed with $pK_a$ and the corrections made with the the calculated state penalty are referred to $\log D_{pK_a}$ and $\log D_{state\ penalty}$ respectively.

Estimating distribution coefficients with a fast, structural based partition coefficient calculator

Many structure-based tools exist for octanol/water partition coefficients; they are very fast and generally accurate. However, these tools are all trained on empirical data, meaning they are limited by the training data. We chose the OpenEye tool OEXlogP [45, 46] as an example of such a tool. Two post-prediction sets were prepared with the OEXlogP tool. First, the predicted octanol/water partition coefficient was considered an estimate for cyclohexane/water distribution coefficient. In the second set, we used a linear regression to correct for the bias between the calculated XlogP values and a set of experimental cyclohexane/water partition coefficients [9]. For the rest of this paper we will refer to the octanol/water partition coefficient set as $X \log\,P_{oct}$ and the bias-corrected set as $X \log\,P_{corr}$.

Exploring the possibility of solvent mixing

Because no two solvents are perfectly immiscible, we wanted to explore the effect that a small amount of water present in cyclohexane would have on computed $\log D$ values for one of the more polar solutes. The experimental concentration of water in cyclohexane is 0.00047 mole fraction [47]. The presence of water in the cyclohexane phase has a possibility of affecting the transfer free energy, especially for solutes with many polar functional groups. Also, it has been suggested that particularly polar compounds can pull water with them from the aqueous layer into the organic solvent [9]; while this is a kinetic argument and should not apply to equilibrium thermodynamic properties like $\log D$, the point is well taken—some solutes have a particularly high affinity for water and may actually impact the amount of water present in cyclohexane when the solute is present at finite concentration. Thus, to investigate this, we took one of the most polar compounds—one for which we had particularly large errors relative to experiment—SAMPL5_074, and performed two additional sets of free energy calculations in cyclohexane. Both sets of simulations had the single solute in 150 cyclohexane molecules, but varied in water content—the first with seven water molecules present in cyclohexane, and the second with only a single water molecule. The input files for these simulations were also created with Solvation Toolkit as described above and the same simulation protocol was followed. These simulations were conducted after the SAMPL submission deadline in order to help us understand the role of water in such cases.

The experiments were not done on completely pure water and cyclohexane—particularly, experimental distribution coefficients were measured with small amounts of dimethyl sulfoxide (DMSO) and acetonitrile present in solution. Therefore, we investigated the how acetonitrile and DMSO would distribute in simulations. The experimental concentrations reported for each solvent are approximately 50, 50, 1, and 0.4 % by volume for cyclohexane, water, DMSO, and acetonitrile respectively [8]. To explore how the DMSO and acetonitrile distribute between cyclohexane and water we performed a single simulation with 130 cyclohexane, 780 water, four DMSO, and two acetonitrile molecules, mirroring those concentrations. Input topology and coordinate files were created with Solvation Toolkit. The system was minimized and equilibrated following our reference calculation procedure and then a 5 ns constant pressure and temperature production simulation was run. The trajectory from this simulation was visualized with VMD [48] and a movie is available in the supplementary information.

Results and discussion

A broad range of methods were used for the 76 submissions predicting cyclohexane/water distribution coefficients for the SAMPL5 challenge. Many of these predictions used alchemical molecular dynamics simulations to estimate the solvation free energy in explicit solvent using several classes of force fields, including fixed-charge all-atom force fields [49–53], all-atom/coarse-grained hybrid force fields [54], and polarizable force fields [55]. One participant used Semi-Explicit Assembly, a type of implicit solvent solvation free energy method applied to one or more chosen solute conformations [56]. A variety of quantum mechanics (QM) methods were also used including QM/molecular mechanics (QM/MM) with explicit solvent [52, 53], QM with non-Boltzmann Bennett free energy calculations [52, 53], and QM energy calculations with a single optimized molecular geometry [52, 57]. Two participants used variations on the reference interaction-site model (RISM), an integral equation approach, to predict solvation free energies [58, 59]. One participant used QM calculations to derive parameters to tune an empirical model for activity coefficients and used these to estimate distribution coefficients [60]. A few submissions used empirically trained methods for calculating solvation free energies [61, 62]. One particularly successful submission, which will be discussed again below, employed the Conductor-Like-Screening Model for Real Solvents (COSMO-RS) [63].

SAMPL5 is the first SAMPL challenge to include distribution coefficients, but we can estimate how well we expect submissions to do based on past SAMPL challenges which included hydration free energies. Distribution coefficients can be related to transfer free energy between solvents, which allows us to estimate an expected performance from root-mean-squared error (RMSE) in past hydration free energy calculations. In SAMPL4 [1], the average RMSE for the best half of submissions was about 1.5 kcal/mol which would correspond to a 1.54 log unit error in a distribution coefficient if both solvation free energies have comparable errors. Here, only five submissions had an RMSE less than 2.5 log units in SAMPL5. There are many reasons for this perceived change in accuracy, such as a more complex set of molecules, the use of cyclohexane as a solvent, and the complexity of estimating tautomer populations, discussed in depth below. Since this is the first challenge on predicting distribution coefficients, it is likely that participants had not yet developed good protocols to deal with many of these challenges, meaning that somewhat less accuracy ought to be expected. It took several challenges focused on hydration [1–7] before a range of methods could achieve the success noted in SAMPL4.

Table 2 Error metrics were calculated for each set of predictions, including root-mean-squared error (RMSE), average unsigned error (AUE), average signed error (ASE), Kendall’s tau (tau), and Pearson’s R (R). Error slope refers to the slope of data in a QQ Plot. Indicated submissions included only batch 0$^{\mathrm{a}}$ or batches 0 and 1$^{\mathrm{b}}$

Full size table

As discussed above, we calculated root-mean-squared error (RMSE), average unsigned error (AUE), average signed error (ASE), Pearson’s R (R), Kendall’s tau (tau), and the slope from the QQ plot (error slope) for each set of predictions. These are reported for all submissions (Table 2), but the rest of the analysis will focus only on submissions that reported results for batches 0, 1, and 2. For each submission, we also created a plot comparing the predicted and experimental values. Some example plots are provided (Fig. 2); these represent a typical submission, in that these submissions were in the middle of the pack by most error metrics. Comparison and QQ plots for every submission are available in the supporting information on Dash as well as error metric tables by batch rather than just for the full set.

To help visualize all of the error metrics, the data was compiled into a histogram where results are sorted from best to worst for that metric (closest to 1 for error slope for example). These metrics are split into measurements of deviation from experiment (Fig. 3) and correlation with experiment (Fig. 4) distinctions which helped in identifying high performing groups. This analysis included only submissions that included data for all molecules; the other submissions were indicated in Table 2 and generally fall in the middle of the pack on most metrics. In comparing methods by all of the error metrics, it is important to keep in mind the uncertainty in these error metrics. While Figs. 3 and 4 are ordered by method performance in some sense, the reality is that there are many submissions that are not significantly different from one another.

In the error slope analysis, the slopes are often substantially different from 1, indicating that participants generally provided poor estimates of model uncertainty. Only the top three submissions are within uncertainty of 1. Sebastian Diaz-Rodriguez et al. from Miami University used conservative estimates based on results in previous calculations for solubility and hydration free energy for submissions 53 and 60 [60]. Gerhard König et al. from Max-Planck-Institut für Kohlenforschung provided no explicit discussion of model uncertainty with submission 43 [53]. Only submission 40 significantly overestimated their model uncertainty. All other submissions have an error slope below one, indicating a significant underestimation of the model uncertainty. This suggests that analysis and prediction of model uncertainty remains a key frontier for predictive molecular simulations, and further effort is needed in that area.

Top performing submissions

In order to determine which submissions performed the best, we group error metrics into two categories. The first category describes typical error relative to experiment, and includes metrics RMSE and AUE. The second category describes how well correlated the experimental values are with the experimental values, and includes Kendall $\tau$ and Pearson R. Unlike past SAMPL challenges, there does appear to be one submission which performs best by all of these metrics, submission 16, and for most metrics it is better by a statistically significant amount. Next, we considered the top ten submissions for each of these four metrics. There were only two submissions, other than 16, which performed in the top ten for at least three of these metrics, submissions 14 and 36. Predictions from each of these submissions are compared to experiment in Fig. 5. For submission 16, Andreas Klamt et al. from COSMOlogic used COSMO-RS to compute a partition coefficient for each solute from the difference in chemical potentials for the solute in each solvent [63]. To find distribution coefficients, calculations for the formation of different protonation states, zwitterions, and tautomers were performed in COSMO-RS for relevant molecules. For submission 14, Frank Pickard et al. from the National Institute of Health calculated solvation free energies from QM calculations with SMD implicit solvent in Gaussian. Absolute $pK_a$ calculations were used to account for additional ionization states [52]. For submission 36, Sherin Shanaka Paranhewage et al. from Oklahoma State University estimated $\log D$ as a partition coefficient, calculated from the difference in alchemical solvation free energies where the solute was parameterized with the dielectrically corrected general AMBER force field, water was the dielectrically corrected H₂O-DC model, and cyclohexane was a specially optimized united-atom model [49]. Further details for each of these submissions can be found in this issue so only a brief explanation of each method was provided here.

Comparisons to simple empirical models

Table 3 Null hypothesis corresponds to $\log D = 0$ for all molecules

Full size table

One way of evaluating predictive models is to compare them to a null hypothesis, or default result of some kind. In the case of distribution coefficients, we chose a null hypothesis where we assume all solute molecules distribute equally between cyclohexane and water, corresponding to $\log\,D = 0$, as suggested by Christopher Fennell [64]. We performed all our standard error analyses discussed above (RMSE, AUE, and Ave. Err) on this simple model as a point of comparison (Table 3). The null hypothesis would have been the top submission for both RMSE and AUE. While this null

hypothesis has no actual predictive power and could not be used to rank compounds, the fact that it performs better than any submission in terms of error statistics is a challenge for the other methods. These results may also provide commentary on the dataset, which contains a reasonably large percentage of $\log D$ values that are not that far off from zero (Fig. 6). Organizers had hoped to ensure equal coverage of all $\log D$ values within the assay range, but due to experimental time constraints this was not possible. It is possible the null model would look worse if the experimental results were more evenly dispersed across the entire dynamic range.

There are many structure-based and/or empirically trained prediction methods for octanol/water partition coefficients. To a first approximation, one might imagine that cyclohexane/water partition coefficients would follow similar trends to those in octanol/water. Therefore, we used OEXlogP from OpenEye ($X\log\,P_{oct}$) to examine the possibility of estimating cyclohexane/water distribution coefficients with such a tool. Next, we compared $X\log\,P_{oct}$ results for a set of compounds with experimental cyclohexane/water partition coefficients [9]. A linear regression was used to correct the $X\log\,P_{oct}$ values with a slope of 0.7241 and a y-intercept of −1.0306 ($X\log\,P_{corr}$). $X\,\log\,P_{oct}$ would be in the top few submissions by tau and R, but ranked in the middle for all other metrics (Table 3). However, with a simple linear regression trained on experimental cyclohexane/water partition coefficients, $X\,\log\,P_{corr}$ has a better RMSE and AUE than any SAMPL5 submission. We do not wish to suggest that regression-trained tools are the best mechanism for predicting distribution coefficients; rather, this indicates the potential for cyclohexane/water distribution data to help drive improvements in our physical models, as clearly there are a range of physical effects here which are not yet well described by our models.

Results of reference calculations

We performed a set of blind reference calculations (submission 39) for the SAMPL5 challenge, calculating $\log P$ for the provided neutral tautomers of all solutes. Our protocol for these calculations was announced in advance, and parameter and coordinate files for the calculations were made available (as described above) in formats for a variety of simulation packages. Participants were encouraged to perform their own set of reference calculations for the full set, or at the very least several specified reference compounds, using these files. This would allow differences in performance to be traced back to methodological differences rather than force field differences. Unfortunately, no participants actually reported results of reference calculations, so this type of analysis has thus far been impossible.

However, the results of our reference calculations are still helpful for understanding the challenges facing SAMPL5 participants.

For our reference calculations, solvation free energies were calculated using GROMACS with GAFF parameters and AM1-BCC charges. Our reference calculations yielded partition coefficients, determined from the difference in solvation free energies without correcting for variation in tautomers. These calculations were done blindly, and analyzed as submission number 39, which was in the top quarter of submissions by most error metrics (Table 2) although, there is a slight bias favoring solvation in cyclohexane, evidenced by the average error ($1.6 \pm 0.3$).

Table 4 Shown are the results for error analysis on our reference calculations which estimated $\log D$ as a cyclohexane partition coefficient ($\log P$) and the correction to distributions coefficients by $pK_a$ ($\log D_{pK_a}$) and by state penalty ($\log D_{state\ penalty}$)

Full size table

After the challenge we explored how including protonation and deprotonation would have affected the initial partition coefficient predictions. The first set of corrections involved calculating the pKa for each molecule using Schrödinger’s Epik tool [41–43]. Next, $\log D$ was calculated using the pKa and partition coefficient determined in submission 39 using Eqs. 3 and 4 for basic and acidic solutes, respectively. We assumed only one change in protonation state occurred so only one $pK_a$ was used. This does not account for zwitterions or alternate neutral tautomers. This correction (labeled $\log D_{pK_a}$) showed a slight improvement by most error metrics (Table 4) including a decrease in the average error from $1.6 \pm 0.3$ to $0.7 \pm 0.3$ indicating less bias toward overly high concentration in cyclohexane.

For the next set of corrections, we used Schrödinger’s

Ligprep tool [44] to enumerate tautomers and calculate a state penalty or relative energy of each tautomer in an aqueous buffer at pH 7.4. The state penalty was used to correct the concentration in the aqueous layer, according to Eq. 5. This correction (labeled $\log D_{state\ penalty}$) results show improvements from the original partition coefficient coefficients for tau ($0.49 \pm 0.08$ to $0.65 \pm 0.06$) and R ($0.6 \pm 0.1$ to $0.77 \pm 0.06$), but no significant change in RMSE or AUE. It should be noted that no attempt was made to estimate uncertainties in the newly calculated $\log D_{pK_a}$ or $\log D_{state\ penalty}$ although both of these corrections would certainty have introduced uncertainty into the estimate. Future exploration in tautomer estimation will need to account for the uncertainty in those calculations.

Both of these correction methods only adjust the concentration in the aqueous layer; however, there may be tautomer affects that would change the concentration in cyclohexane as well. Outliers and molecules with particularly significant changes in $\log D$ were indicated by number in Fig. 7. SAMPL_050 for example, had an initial $\log P$ value of $1.20 \pm 0.04$ which was decreased significantly to $-9.78$ with the pKa correction and $-10.70$ with the state penalty correction compared to the experimental value $-3.2 \pm 0.6$. SAMPL_060 and SAMPL_063 also changed by more than 3 log units due to these corrections. These state penalties allow us to account for other tautomers and protonation states only in the aqueous phase. Without tautomer enumeration in cyclohexane, we have to assume that the tautomer used for solvation free energy calculations, prior to correction, is the dominant state of the solute in the cyclohexane phase. If an alternate tautomer were relevant in water and cyclohexane, we could obtain dramatically incorrect values with this approach, since our state penalty will only recover alternate tautomer(s) in water, but not cyclohexane. This appears to be one of the reasons why these corrections seem to overshoot in the cases listed above. A better solution would compute state penalties in both water and cyclohexane, but we do not currently have an adequate approach for doing so.

Even after correcting for tautomer enumeration in the aqueous phase, there is a slight bias for cyclohexane in the calculated distribution coefficients (Table 4; Fig. 7). There are a variety of factors that could be contributing to this slight bias. A recent study by the Mobley group calculated partition coefficients for small molecules and found a slight bias for alcohol compounds to over favor cyclohexane [23]. Possibly, this demonstrates a limitation in GAFF or atomistic force fields in general to accurately predict the behavior of solutes in polar and non-polar environments. That same study found that for large, flexible compounds insufficient conformational sampling can dramatically affect the calculated partition coefficients [23]. Given the number of large, flexible, and polyfunctional compounds in SAMPL5, more investigation into each of these effects along with improved tautomer enumeration and handling (especially for the cyclohexane phase) will be required to completely understand the slight bias seen here.

Examining individual molecules

With only 53 molecules, it is difficult to find any statistically significant trends in terms of functional groups which are well- or poorly-predicted in general; compared to past SAMPL challenges, this set of molecules is much more complex. They are on average larger, more flexible, and contain multiple functional groups per compound. For each molecule, we organized a data set of predicted distribution coefficients and compared them to the experimental values, calculating the average unsigned error for each (Table 1). There are only three molecules with an AUE less than 2.0 log units (SAMPL5_ 003, 045, and 059). While these three molecules are relatively small, there were no trends in AUE and molecular weight, as there was in SAMPL4 hydration free energy results [1]. We tried grouping molecules by functional group, molecular mass, and estimated number of tautomers to see if size, presence or absence of particular functional groups, or number of tautomers played a role in how difficult each compound was in general. The only trend found in this process was that all five carboxylic acids (SAMPL5_ 010, 011, 015, 026, and 060) are in the worst ten molecules by AUE and RMSE. This could be due to poor treatment of the effects of protonation state changes. Among the bottom compounds, perhaps unsurprisingly, was SAMPL5_083 which is a large macrocycle, and SAMPL5_050; both have many neutral tautomeric forms. Most submissions had significant errors in predicting SAMPL5_074, despite the fact that it is relatively small, rigid, and has no other significant tautomers. Below we will explore why some of these molecules may have had distribution coefficients which were particularly difficult to predict.

The provided SMILES strings may not be the most populated tautomeric form of the molecule

From our tautomer enumeration and discussions with other SAMPL5 participants [65] it became clear that accurately estimating $\log D$ for molecules with many tautomers was difficult. For example, we compared population corrections we derived using Schrödinger’s LigPrep [44] with corrections calculated by Pickard et al. [52, 66] and found significant differences between them. If we could perfectly calculate solvation free energies and tautomer populations in both solvents, the starting tautomer should not affect the final calculated distribution coefficient, but differing corrections—such as these—will yield different results. Additionally, whenever protonation state/tautomer populations are not estimated correctly or not included in both solvents, the initial choice of protonation state/tautomer is likely to affect computed $\log D$ values. Here, our initial solvation free energy calculations used provided SMILES strings without any consideration of other tautomers. To explore how this may have affected our $\log D$ calculation, we decided to repeat a few solvation free energy calculations with alternate tautomers. We used SAMPL5_050 and SAMPL5_083 as examples since both have other neutral tautomers that could be present in both the water and cyclohexane solutions. Also, most participants failed to accurately predict the correct $\log D$ for either solute. The new tautomer for SAMPL5_050 was found with LigPrep [44]; the one for SAMPL5_083 was found with COSMO-RS and was provided by Andreas Klamt [63, 65]. For both SAMPL5_050 and SAMPL5_083 there were significant changes in calculated solvation free energies and partition coefficients for the two different tautomers (Table 5). Distribution coefficients were calculated from the $\log P$ and state penalties calculated with Schrödinger’s LigPrep tool [44]. In both cases the $\log D$ is still significantly different from the experimental values. Since both the calculated solvation free energies and the tautomer/protomer populations are needed to estimate the distribution coefficient, it is impossible for us to know which calculation introduces more error into our estimates.

Table 5 Shown are the results for solvation free energy of two different tautomers of SAMPL5_ 050 and 083

Full size table

Table 6 Shown are results for the solvation free energy and distribution coefficient for SAMPL5_074 in cyclohexane with no water present, 1 water molecule, and 7 water molecules all with 150 cyclohexane molecules and 1 solute molecule

Full size table

Solvents are not completely immiscible

Though the concentration of water in cyclohexane is very small, 0.00047 mole fraction [47], it may still affect how a solute is distributed across the two solvents. This will be particularly important for solutes with many polar groups; and may be one reason it was difficult to accurate estimate the $\log D$ for SAMPL5_074. We performed two new calculations of solvation free energy of SAMPL5_074 into cyclohexane with some water present in cyclohexane. These simulations were set-up with Solvation Toolkit as described above. Both sets of simulations had a single solute molecule in 150 cyclohexanes, but one had seven water molecules and the other had a single water molecule. While neither of these results reach the low experimental concentration of water in cyclohexane, they do demonstrate the dramatic variation in solvation free energy and $\log D$ which can be caused by the presence of water in cyclohexane. With varying amounts of water in cyclohexane, the water dramatically impacts the computed solvation free energy for SAMPL5_074 into cyclohexane (Table 6) and thus calculated $\log D$ values. In this case particular case, the presence of water also dramatically improves the estimation for $\log D$, though as noted this is with far too high a water concentration in cyclohexane. We visualized trajectories from the production phase of our calculations and find that all water molecules stay adjacent to SAMPL5_074 for the full simulation, likely indicating a particularly high affinity for water. Addition of one or a few molecules of a second solvent (water) to cyclohexane might be expected to raise the uncertainty in calculated free energies significantly due to slow mixing of the second solvent in the simulation box, but we do not observe that here. It seems likely this is because the water molecules are so strongly attracted to the solute that they essentially stay bound throughout the simulations, so we do not observe slow mixing and the associated increased uncertainty. For a solute with so many polar functional groups, it is perhaps unsurprising that the water molecules in cyclohexane are drawn to the solute. In general, this suggests that the local concentration of water near highly polar solutes may be much higher than the bulk concentration in cyclohexane, and this may potentially be important when considering simulation settings. It is important to note that the concentration of water in cyclohexane for these simulations are 0.0067 and 0.047 for 1 and 7 water molecules, both a gross overestimate of the amount of water in cyclohexane (by multiple orders of magnitude) as our focus here was to explore the sensitivity of $\log D$ to cyclohexane water content. However, our results are sufficient to show that the presence of water in the cyclohexane phase can dramatically affect the computed distribution coefficient, depending on the affinity of the solute for water. To accurately account for these effects, we will need long simulations at much larger box sizes to not only match the experimental water concentration, but to determine how much water will localize around the solute at equilibrium. Given the extent to which water stays localized near the solute in these simulations, substantially longer simulations may also be needed to converge the relative populations of water in bulk cyclohexane versus near the solute.

Other buffer/solution components may affect distribution coefficients

The two phases for the distribution coefficients were cyclohexane and an aqueous buffer, however, DMSO and acetonitrile were used in the experiments [8] as well. While DMSO and acetonitrile were at very low concentrations, their presence in either solvent layer may affect how a solute distributes between phases. We created topology and coordinate files for a system with 780 water, 130 cyclohexane, 4 DMSO, and 2 acetonitrile molecules using SolvationToolkit, roughly matching the experimental concentrations. In the trajectory visualized with VMD, both DMSO and acetonitrile spend most of their time near the solvent interface, with very little movement into the bulk of the water or cyclohexane. In our view, this does not immediately suggest profound implications for calculated $\log D$ values, but a detailed understanding of the implications of this for both calculated and measured distribution coefficients will require further study.

Conclusions

Past SAMPL challenges often involved a broad range of methods for hydration free energies. Here, in our first SAMPL on cyclohexane/water distribution coefficients, we saw a similarly diverse set of methods. Predicting $\log D$ accurately for this set of molecules was rather difficult, exhibiting a number of the same challenges that will face accurate prediction of binding affinities. The best methods showed reasonable agreement with experiment with RMSEs around 2.5 log units and Kendall tau’s around 0.6. However, considering that the null model and $Xlog \, P_{corr}$ both would have topped the submissions list by RMSE and AUE, there is clear room for all of the methods employed in SAMPL5 to improve.

This SAMPL5 set was substantially more complex, flexible, and polyfunctional than typical molecules in SAMPL hydration challenge sets. Additionally, the most relevant protonation state and tautomer were not always clear for the compounds. Some compounds likely had multiple relevant protonation states and tautomers, and shifts in protonation/tautomeric state on transferring between phases. These issues are especially important given that the challenge focused on distribution coefficients rather than partition coefficients ($\log P$). Given these complexities, it is not surprising that we saw drop in performance relative to the accuracy that would have been expected for $\log D$ values if participants predicted $\log D$ based on solvation free energies in two solvents that could be computed as accurately as hydration free energies in previous SAMPL challenge (yielding an expected accuracy of about 1.5 log units). In large part, this is probably because of the additional complexities of distribution coefficients such as the need to account for other protonation states and tautomers. Our solvation free energy calculations performed with a less dominant tautomer of SAMPL5_050 lead to a $\log D$ estimate of $-10.70 \pm 0.04$ which is 7.5 log units below the experimental value. Thus, accurately accounting for tautomers appears to be a vital part of accurately calculating $\log D$, and better methods for treating protonation and especially tautomeric states in non-aqueous environments are needed.

As mentioned earlier, the molecules chosen for SAMPL5 are generally larger and more flexible than those in past SAMPL challenges. It follows that conformational sampling of the solutes might play a significant role in accurately predicting these distribution coefficients. In a previous study using the same protocol as the reference calculation here, the Mobley lab showed that changing the initial conformation of a large, flexible molecule dramatically changed its calculated partition coefficient due to insufficient conformational sampling [23]. In this issue, Tyler Luchko et al. spend time addressing the importance of conformational sampling for these molecules in the context of their submission results for the challenge [59].

We asked participants to estimate two forms of uncertainty, statistical uncertainty and model uncertainty, the latter of which should predict how well their calculation will agree with experiment. This latter uncertainty estimate is particularly key, as it would allow practitioners to predict how reliable their calculations are likely to be in applications. Here, we find that almost every participant dramatically underestimated their model uncertainty. The importance for the community to improve error estimation has been addressed in past SAMPL challenges [1], but clearly, much more work is still needed.

A common trend in SAMPL5 submissions was that the dynamic range in the predicted distribution coefficients was larger than the dynamic range observed experimentally—that is, generally, the smallest $\log D$ values were underestimated and the largest were overestimated. This issue of dynamic range is visible in Figs. 2, 5, and 7, but was evident in comparison plots for almost all submissions. For SAMPL5_074, the introduction of water to cyclohexane increased the $\log D$ estimate from −3.76 to −1.73; it is possible that accounting for water in cyclohexane would affect the apparent underestimation at the lower end of the $\log D$ scale. As discussed above, it is also possible that insufficient conformational sampling could account for some of this problem. For example, if a solute has a particular conformation that is more stable in a non-polar environment and that conformation was not found for the solute in cyclohexane due to insufficient conformational sampling, then the calculated distribution coefficient would over favor the aqueous environment. However, this will require further investigation. It is of course also possible that experimental issues could have compressed the measured dynamic range of compounds, but this too will require further investigation.

The results of this challenge strongly suggest predictions for solute partitioning will be extremely helpful for driving improvements to physical modeling needed in pharmaceutical research. The major challenges encountered here are all very likely to occur when attempting to predict binding affinities or other biomolecular properties of interest to drug discovery. Specifically, accurately predicting the population of protonation and tautomeric states was a challenge, complicated by the fact that there is no simple way to predicted protonation states and tautomers to corresponding experimental values relevant to the conditions studied here. Most work so far on tautomer prediction has focused on tautomeric ratios in vacuum or in water, but tautomer populations are likely environment-dependent in ways that can dramatically affect computed physical properties. These same challenges are likely to apply when predicting biomolecular binding. An improved treatment of these effects within the context of SAMPL or similar challenges will drive advances in computational techniques also used to predict binding and related properties such as solubility.

Overall, distribution coefficients have been an extremely valuable part of this year’s SAMPL5 challenge, and, since they can be measured in a relatively straightforward way, seem to be a promising potential source of future data for blind challenges. Additionally, this data highlights important issues, such as tautomer enumeration, that need better treatment in many of our models. The ability to create new, completely blind data sets make distribution coefficients a great option for future challenges.

Supporting information

Supporting information is available free of charge online through the University of California Dash at http://n2t.net/ark:/b7280/d1988w. It includes all of the files provided to SAMPL5 participants. That includes a table of SAMPL5_ID numbers paired with SMILES, MOL2 files for each molecule, and the GROMACS, AMBER, DESMOND, and LAMMPS topology and coordinate files for each solute in water and cyclohexane. A separate directory is included with analysis associated with our reference calculations and post sample method development. This includes all python scripts, input files, output files, and results files required to repeat our simulations and calculations done with Schrödinger tools. We also provide the data files for all submission, the scripts used for error analysis, plots for all submissions, and data for only batch 0 and batches 0 and 1. This is supported with detailed README files explaining the structure of the directories.

References

Mobley DL, Wymer KL, Lim NM, Guthrie JP (2014) J Comput Aided Mol Des 28(3):135
Article CAS Google Scholar
Geballe MT, Guthrie JP (2012) J Comput Aided Mol Des 26(5):489
Article CAS Google Scholar
Geballe MT, Skillman AG, Nicholls A, Guthrie JP, Taylor PJ (2010) J Comput Aided Mol Des 24(4):259
Article CAS Google Scholar
Klimovich PV, Mobley DL (2010) J Comput Aided Mol Des 24(4):307
Article CAS Google Scholar
Mobley DL, Bayly CI, Cooper MD, Dill KA (2009) J Phys Chem B 113(14):4533
Article CAS Google Scholar
Mobley DL, Liu S, Cerutti DS, Swope WC, Rice JE (2012) J Comput Aided Mol Des 26(5):551
Article CAS Google Scholar
Nicholls A, Mobley DL, Guthrie JP, Chodera JD, Bayly CI, Cooper MD, Pande VS (2008) J Med Chem 51(4):769
Article CAS Google Scholar
Rustenburg AS, Dancer J, Lin B, Feng JA, Ortwine DF, Mobley DL, Chodera JD (2016) J Comput Aided Mol Des
Leo A, Hansch C, Elkins D (1971) Chem Rev 71(6):525
Article CAS Google Scholar
Young RJ, Green DVS, Luscombe CN, Hill AP (2011) Drug Discov Today 16(17–18):822
Article CAS Google Scholar
Essex JW, Reynolds CA, Richards WG (1992) J Am Chem Soc 114(10):3634
Article CAS Google Scholar
Best SA, Merz KM Jr, Reynolds CH (1999) J Phys Chem B 103(4):714
Article CAS Google Scholar
Eksterowicz JE, Miller JL, Kollman PA (1997) J Phys Chem B 101(50):10971
Article CAS Google Scholar
Jorgensen WL (1989) Acc Chem Res 22:187
Article Google Scholar
Jorgensen WL, Briggs JM, Contreras L (1990) J Phys 94(4):1683
CAS Google Scholar
Garrido NM, Queimada AJ, Jorge M, Macedo EA, Economou IG (2009) J Chem Theory Comput 5(9):2436
Article CAS Google Scholar
Garrido NM, Jorge M, Queimada AJ, Gomes JRB, Economou IG, Macedo EA (2011) Phys Chem Chem Phys 13(38):17384
Article CAS Google Scholar
Garrido NM, Economou IG, Queimada AJ, Jorge M, Macedo EA (2012) AIChE J 58(6):1929
Article CAS Google Scholar
Yang L, Ahmed A, Sandler SI (2013) J Comput Chem 34(4):284
Article Google Scholar
Michel J, Orsi M, Essex JW (2007) J Phys Chem B 112(3):657
Article Google Scholar
Genheden S (2016) J Chem Theory Comput 12(1):297
Article CAS Google Scholar
I. OpenEye Scientific Software. Oechem (2010). www.eyesopen.com
Bannan CC, Calabró G, Kyu DY, Mobley DL (2016) J Chem Theory Comput 12(8):4015
Article Google Scholar
Wilk MB, Gnanadesikan R (1968) Biometrika 55(1):1
CAS Google Scholar
Berendsen HJC, Van Der Spoel D, van Drunen R (1995) Comput Phys Commun 91(1–3):43
Article CAS Google Scholar
Hess B, Kutzner C, van der Spoel D, Lindahl E (2008) J Chem Theory Comput 4(3):435
Article CAS Google Scholar
Lindahl E, Hess B, van der Spoel D (2001) J Mol Model 7(8):306
Article CAS Google Scholar
van der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJC (2005) J Comput Chem 26(16):1701
Article Google Scholar
Pronk S, Páll S, Schulz R, Larsson P, Bjelkmar P, Apostolov R, Shirts MR, Smith JC, Kasson PM, van der Spoel D, Hess B, Lindahl E (2013) Bioinformatics (Oxford, England) 29(7):845
Article CAS Google Scholar
Páll S, Abraham MJ, Kutzner C, Hess B, Lindahl E (2014) Solving software challenges for exascale, vol 8759. Springer, Stockholm
Google Scholar
Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E (2015) SoftwareX 1–2:19
Article Google Scholar
Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA (2004) J Comput Chem 25(9):1157
Article CAS Google Scholar
Jakalian A, Bush BL, Jack DB, Bayly CI (2000) J Comput Chem 21(2):132
Article CAS Google Scholar
Jakalian A, Jack DB, Bayly CI (2002) J Comput Chem 23(16):1623
Article CAS Google Scholar
Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML (1983) J Chem Phys 79(2):926
Article CAS Google Scholar
Liu S, Cao S, Hoang K, Young KL, Paluch AS, Mobley DL (2016) J Chem Theory Comput 12(4):1930
Article CAS Google Scholar
Klimovich PV, Shirts MR, Mobley DL (2015) J Comput Aided Mol Des 29(5):397
Article CAS Google Scholar
Parameswaran S, Mobley DL (2014) J Comput Aided Mol Des 28(8):825
Article CAS Google Scholar
Lide DR (ed) (1996) CRC handbook of chemistry and physics, 76th edn. CRC Press, Boca Raton
Google Scholar
Sangster J (1989) J Phys Chem Ref Data 18:1111
Article CAS Google Scholar
Schrödinger Release 2014-4: Epik, version 3.0, Schrödinger, LLC, New York, NY, (2014)
Shelley JC, Cholleti A, Frye LL, Greenwood JR, Timlin MR, Uchimaya M (2007) J Comput Aided Mol Des 21(12):681
Article CAS Google Scholar
Greenwood JR, Calkins D, Sullivan AP, Shelley JC (2010) J Comput Aided Mol Des 24(6–7):591
Article CAS Google Scholar
Schrödinger Release 2014-4: Ligprep, version 3.2, Schrödinger, LLC, New York, NY, (2014)
Wang R, Fu Y, Lai L (1997) J Chem Inf Model 37(3):615
CAS Google Scholar
Wang R, Gao Y, Lai L (2000) Perspect Drug Discov Des 19(1):47
Article CAS Google Scholar
Black C, Joris GG, Taylor HS (1948) J Chem Phys 16(5):538
Article Google Scholar
Humphrey W, Dalke A, Schulten K (1996) J Mol Graph 14(1):33
Article CAS Google Scholar
Paranahewage SS, Gierhart CS, Fennell CJ (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9950-z
Google Scholar
Iorga B, Kenney IM, Beckstein O (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9949-5
Google Scholar
Bosisio S, Mey ASJS, Michel J (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9933-0
Google Scholar
Pickard F, König G, Tofoleanu F, Lee J, Simmonett A, Shao Y, Ponder J, Brooks BR (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9955-7
Google Scholar
König G, Pickard FC, Huang J, Simmonett AC, Tofoleanu F, Lee J, Dral PO, Samarjeet FNU, Jones M, Shao Y, Thiel W, Brooks BR (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9936-x
Google Scholar
Genheden S, Essex J (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9926-z
Google Scholar
Kamath G, Kurnikov I, Fain B, Leontyev I, Illarionov A, Butin O, Olevanov M, Pereyaslavets L (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9958-4
Google Scholar
Brini E, Paranahewage SS, Fennell CJ, Dill KA (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9961-9
Google Scholar
Jones MR, Brooks BR, Wilson AK (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9964-6
Google Scholar
Tielker N, Tomazic D, Heil J, Kloss T, Ehrhart S, Güssregen S, Schmidt KF, Kast S (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9939-7
Google Scholar
Luchko T, Blinov N, Limon GC, Joyce KP, Kovalenko A (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9947-7
Google Scholar
Diaz-Rodriguez S, Bozada SM, Phifer JR, Paluch AS (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9945-9
Google Scholar
Park H, Chung KC (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9928-x
Google Scholar
Santos-Martins D, Fernandes PA, Ramos MJa (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9951-y
Google Scholar
Klamt A, Eckert F, Reinisch J, Wichmann K (2016) J Comput Aided Mol Des. doi:10.1007/s10822-016-9927-y
Google Scholar
Fennell CJ (2016) Personal Communication
Klamt A (2016) Personal Communication
Pickard IV FC (2016) Personal Communication

Download references

Acknowledgments

D.L.M. and C.C.B. appreciate financial support from the National Institutes of Health (1R01GM108889-01) and the National Science Foundation (CHE 1352608), and computing support from the UCI GreenPlanet cluster, supported in part by NSF Grant CHE-0840513. This work was made possible in part by NIH grant U01 GM111528 for the Drug Design Data Resource, which supported the SAMPL workshop. M.K.G. thanks the National Institutes of Health for Grant GM061300. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. M.K.G. has an equity interest in and is a cofounder and scientific advisor of VeraChem LLC. We would also like to acknowledge John Shelley, Art Bochevarov, Robert Abel, and Mats Svensson from Schrödinger for their help with pKa and tautomer enumeration calculations. We also thank all the SAMPL5 participants and D3R Workshop attendees, and we especially appreciate valuable discussions with John Chodera (MSKCC), Ariën Rustenburg (MSKCC), Andreas Klamt (COSMOLogic), Christopher Fennell (Oklahoma State University), Samuel Genheden (Gothenburg University), and Frank Pickard (National Institute of Health).

Author information

Authors and Affiliations

Department of Chemistry, University of California, 147 Bison Modular, Irvine, CA, 92697, USA
Caitlin C. Bannan & David L. Mobley
Department of Pharmaceutical Sciences, University of California, 147 Bison Modular, Irvine, CA, 92697, USA
Kalistyn H. Burley & David L. Mobley
Qualcomm Institute, University of California, San Diego, CA, USA
Michael Chiu
Chemical and Biological Engineering, University of Colorado, Boulder, CO, USA
Michael R. Shirts
Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, CA, USA
Michael K. Gilson

Authors

Caitlin C. Bannan
View author publications
You can also search for this author in PubMed Google Scholar
Kalistyn H. Burley
View author publications
You can also search for this author in PubMed Google Scholar
Michael Chiu
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Shirts
View author publications
You can also search for this author in PubMed Google Scholar
Michael K. Gilson
View author publications
You can also search for this author in PubMed Google Scholar
David L. Mobley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David L. Mobley.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bannan, C.C., Burley, K.H., Chiu, M. et al. Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge. J Comput Aided Mol Des 30, 927–944 (2016). https://doi.org/10.1007/s10822-016-9954-8

Download citation

Received: 21 June 2016
Accepted: 22 August 2016
Published: 27 September 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s10822-016-9954-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Blind prediction of cyclohexane–water distribution coefficients from the SAMPL5 challenge

Abstract

Similar content being viewed by others

Measuring experimental cyclohexane-water distribution coefficients for the SAMPL5 challenge

Prediction of cyclohexane-water distribution coefficients for the SAMPL5 data set using molecular dynamics simulations with the OPLS-AA force field

Blinded predictions of distribution coefficients in the SAMPL5 challenge

Introduction

Challenge logistics

Analysis of submission performance

Reference calculations