Introduction

The second round of the SAMPL6 challenge was aimed at predicting the 1-octanol-water partition coefficient, LogP, for the neutral forms of the eleven compounds shown in Fig. 1. The experimentally measured LogP coefficients are reported in Ref. [1]. These molecules, are based on the quinazoline, imidazole or pyridine heteroaromatic moieties, and are characterized by the presence of chemical groups that are ubiquitous in drug-like compounds, such as amide, amine, halogen, oxo, hydroxy and carboxy moieties. LogP’s are important physical quantities in drug discovery, since they provide, in principle, valuable indications on the distribution of a molecule between a hydrophobic (e.g. lipid bilayer) and a cytosolic, aqueous environment [2].

In the context of the “physical” approaches in SAMPL6 challenge, the LogP is computed by evaluating independently the solvation free energy in water and 1-octanol of the neutral species in standard conditions:

$$\begin{aligned} \mathrm{LogP} = \mathrm{Log} \frac{[\mathrm{solute}]_\mathrm{oct}}{[\mathrm{solute}]_\mathrm{wat}} = -\frac{\varDelta G_\mathrm{oct}-\varDelta G_\mathrm{wat} }{RT \ln 10} \end{aligned}$$
(1)

The best ranking among the 47 submissions using physical methods were all based on high level quantum chemical (QM) calculations with implicit solvent parametrizations. QM approaches were among the top performing submissions also for the preceding SAMPL5 challenge on water/cyclohexane distribution coefficients [3]. QM “winners” in the recent LogP and past LogD challenges actually won a Pyrrhic victory. These good performances were largely expected as the approach based on high-level QM calculation using implicit solvation models with adjustable parameters specifically trained on experimental data-set fits well with the fact the solute is surrounded by a homogeneous environment. By the same token, inexpensive atom-based or fragment-based empirical methods like the xLogP3 [4], ClogP/AlogP [5] or miLogP [6], using thousands of compound for the training sets, all yields very good root mean square error (RMSE) and Pearson correlation coefficient (R) for the series of compounds of Fig. 1.

Fig. 1
figure 1

SM-type compounds in the SAMPL6 LogP challenge

However, the ambitious scope of SAMPL challenges for physical properties is eventually that of testing the predictive power of computational methodologies in perspective drug design projects, that is to evaluate “solvation energies” when the solute/ligand is embedded in a highly heterogeneous environment, modulated in a complicated way by the fluctuating protein scaffold where micro-solvation phenomena, related to the atomistic nature of the “solvent”, can play a crucial role. As the host-guest SAMPL6 blind challenges have shown [7, 8], even these kinds of simple ligand-receptor systems seem to be out of the reach of QM-based approaches with implicit solvation schemes. In the two latest host-guest SAMPL challenges [7, 8], QM-based submissions for binding free energies of small guest molecules hosted in octa-acids or cucurbit[n]uril-type molecular containers were in fact consistently outperformed by those obtained using classical molecular Dynamics (MD) techniques with explicit solvent.

MD-based methods with explicit solvent appear on the overall to yield consistent performances in SAMPL challenges, whether they are applied to the distribution or partition coefficients or to the more challenging tests of host-guest binding free energies. The accuracy in MD schemes is related to the bias or systematic error due to the adopted atomistic interaction potential or force field. The precision, that is the reproducibility of the datum, is affected by the inherent variance of the implemented MD methodologies. While the latter can be in principle minimized by investing more and more computational resources (i.e. improving the statistical convergence of the simulations), the sources of the force field error are disparate and complex and can involve any combination of the bonded and non bonded solute-solute solute-solvent and solvent-solvent parametrizations. In this regard, SAMPL6 challenges are extremely useful since they provide an ideal collaborative platform on which force fields performances can be rigorously assessed and new avenues for their improvements can be discovered.

In this paper, we will present and discuss the results of a MD-based approach using non equilibrium switching (NES) for the solvation energies and the LogP coefficients for the series of compounds of Fig. 1. The methodology computes the LogP according to Eq. 1, relying on the enhanced sampling of the end-states (fully coupled and fully decoupled solute in water and in 1-octanol) with Hamiltonian Replica Exchange (HREX) and in the subsequent production of hundreds of concurrent and independent non-equilibrium (NE) trajectories where the solute is alchemically driven from the starting canonical ensemble of one end-state to the corresponding nonequilibrium ensemble of the arrival end-state in matter of few hundreds of picoseconds [9, 10]. The solvation free energies in water and in 1-octanol are recovered from the NE work distributions using the Jarzynski and the Crooks fluctuation theorems [11, 12].

Calculations were done using three popular general force fields for drug-like molecules, namely GAFF2 [13, 14], CGenFF [15] and OPLS-AA [16, 17]. For each force field, two blind predictions were uploaded: (i) a “challenge” and computationally expensive submission, done with a precise [18, 19] bidirectional approach based on the Bennett Acceptance Ratio [20] (BAR) estimator; (ii) a less precise and faster submission, obtained with the fast-growth unidirectional method exploiting the Crooks theorem for normal work distributions [21,22,23,24]. We will try to put our contribution to the SAMPL6 initiative into the context of the other MD-based submissions. These were done, in the vast majority of cases, adopting the CGenFF, OPLS-AA or GAFF force fields, or refined variants of them, with equilibrium methodologies relying on the alchemical stratification [25, 26] and exploiting the thermodynamic integration [27] (TI) or the Free energy perturbation methods [28] (FEP).

The paper is organized as follows. In Section "Alchemical transformations: theoretical background", we succinctly introduce the equilibrium and nonequilibrium approaches in the context of the MD-based physical methods. In Section "Materials and methods in NES submissions", we provide the detailed methodological information concerning the NES submissions, including force field parametrization, atomistic description of the water and 1-octanol solvent, simulation parameters for the HREX stage and for the subsequent NE computational task. In Section "Overview on MD-based SAMPL6 submissions" a global assessment of the MD-based submissions in LogP/SAMPL6 is presented, highlighting common patterns and discrepancies, as well as force field-related critical issues. In section "NES results", the results obtained using NES methodologies, in the two variants fast switching growth/annihilation method (NES-2) and fast switching growth method (NES-1) and for each of the three general force fields, are critically discussed and compared to the related equilibrium FEP or TI blind predictions. Conclusions and perspectives are sketched out in the last section.

Alchemical transformations: theoretical background

Virtually all MD-based submissions in the challenge were done by evaluating the solvation free energies in Eq. 1 by way of the so-called alchemical approach whereby the solute-solvent interaction is gradually turned off or on, using a decoupling/recoupling inter-molecular alchemical parameter \(\lambda\) such that: \(V_{\lambda } = V_\mathrm{solv} + V_\mathrm{solute} + \lambda V_\mathrm{solute-solvent}\). At \(\lambda =1\) the molecule is fully solvated; at \(\lambda =0\) the molecule is totally decoupled from the solvent acting as if it were in the gas-phase. Such methodology can be implemented in the context of equilibrium simulations using the stratification strategy or multistage sampling, or, in the context of nonequilibrium thermodynamics, producing many concurrent and independent fast switching trajectories. To make the paper self contained, here we briefly out-sketch the theoretical background of the two approaches, referring the reader, for a more complete treatment of this subject, to excellent recent reviews [19, 26, 29].

In the equilibrium techniques, the system is simulated at constant pressure and temperature in an appropriate number n of intermediate states corresponding to values of the \(\lambda\) coupling parameter between 0 and 1. The Gibbs solvation free energy is recovered in an inexpensive post-processing stage by summing up contributions obtained by applying the FEP Zwanzig formula [28] for each of the contiguous \(\lambda\) states, i.e.

$$\begin{aligned} \varDelta G_\mathrm{solv} = -\beta ^{-1} \sum _i \ln \langle e^{-\beta (V_{\lambda _{i}}-V_{\lambda _{i+1}})} \rangle _{\lambda _i} \end{aligned}$$
(2)

where \(\beta\) is the reciprocal of the thermodynamic temperature and \(\langle \cdot \rangle _{\lambda _i}\) denotes the isothermal-isobaric average taken with potential energy \(V_{\lambda _i}\). Precision can be increased at a very limited cost by storing during the simulations both the \(\lambda _{i+1}\) and the \(\lambda _{i-1}\) value of the potential energy so that the free energy can be recovered as a sum of BAR contributions [19, 26, 29, 30]. Alternatively, and equivalently, the \(\lambda\)-derivatives of the potential energies can be stored in the stratification, recovering the solvation free energy via numerical thermodynamic integration [27]:

$$\begin{aligned} \varDelta G_\mathrm{solv} = \sum _i \left\langle \partial V_\lambda /\partial \lambda \right\rangle _{\lambda _i} \varDelta \lambda \end{aligned}$$
(3)

It is important to stress that in FEP or TI approaches the simulation must be well converged on each of the n alchemical \(\lambda\) strata, with a canonical sampling of the relevant solute and solvent conformational space. The convergence rate for a given stratum is an unknown function of the corresponding \(\lambda\) coupling parameter and can vary substantially in the range [0,1]. Typically, barriers between conformational state becomes higher at low coupling (i.e. when \(\lambda \rightarrow 0\)) due to the lack of the screening effect of the solvent on the intramolecular electrostatic interactions [31], making harder the convergence of the simulation with weakly coupled solutes. Also, in setting up the FEP or TI simulations, due care must be taken in “choosing the alchemical protocol so that the total uncertainty for the transformation is the one which has an equal contribution to the uncertainty across every point along the alchemical path” [32], or equivalently so that the overlap between contiguous potential energy distributions is significant and approximately constant in the whole range [0,1], a task that would require the prior knowledge of the dependence of \(\varDelta G\) on \(\lambda\) [26].

In the nonequilibrium approach, equilibrium sampling in the isothermal isobaric ensemble is required only for the end-states, which can be effectively implemented using specialized and highly efficient HREX enhanced sampling schemes [33]. Starting from a representative HREX-sample of n phase-space points (from few tens to few hundreds) of a given end-state, the system is rapidly driven to the other end-state by continuously varying the \(\lambda\) alchemical parameter in a swarm of corresponding n concurrent and independent NE trajectories, typically lasting from a time \(\tau\) of few tens to few hundreds of picoseconds and eventually producing an alchemical NE work computed as \(W=\int _0^{\tau } \frac{\partial U}{\partial \lambda }\dot{\lambda }dt\). The fast switching stage can be straightforwardly implemented on a single embarrassingly parallel job on modern HPC platforms, allowing the computation of the NE work distribution in a matter of wall-time minutes.

The NES process can be conducted in the two senses, with the \(0 \le \lambda \le 1\) and \(1 \le \lambda \le 0\) processes conventionally indicated [26] as the forward growth and the reverse annihilation of the solute, respectively. The growth and annihilation NES simulations produce two independent unidirectional estimates of the solvation free energy, namely

$$\begin{aligned} \varDelta G= & {} -\beta^{-1} \ln \left ( \frac{1}{n} \sum _{i=1}^n e^{-\beta W_i(G)} \right ) \end{aligned}$$
(4)
$$\begin{aligned} \varDelta G= & {} \beta^{-1} \ln \left ( \frac{1}{n} \sum _{i=1}^n e^{-\beta W_i(A)} \right ) \end{aligned}$$
(5)

The above unidirectional estimates based on the Jarzynski exponential average, while asymptotically exact, are nonetheless affected by a bias error that decreases with the number of work values, n, and with the dissipation [34], defined as the difference between the mean NE work and the underlying free energy. Provided that the fast growth and annihilation transformations are conducted with inverted time schedules, the two estimates of Eqs. 4 and 5 can be combined in a bidirectional, statistically efficient and unbiased [18] BAR estimator where \(\varDelta G\) is given by the root of the equation

$$\begin{aligned} \sum _{i=1}^{n} \frac{1}{ 1 + \mathrm{e}^{ \beta ( W_i(G) - \varDelta G ) } } - \sum _{i=1}^{n} \frac{1}{ 1 + \mathrm{e}^{ \beta ( W_i(A) + \varDelta G ) } } = 0 \end{aligned}$$
(6)

If the the, e.g., growth work distribution \(P_G(W)\) is found to be normal then, as a trivial consequence of the Crooks theorem [35], the annihilation distribution, \(P_A(-W)\), done with inverted time schedule, must be normal too with the same variance, \(\sigma _A=\sigma _G\). The two forward and reverse distributions are symmetrical with respect to the crossing point at \(W=\varDelta G\). For normal distribution(s), the free energy can be hence recovered with the unbiased unidirectional estimators:

$$\begin{aligned} \varDelta G= & {} \langle W_G \rangle - \frac{1}{2} \beta \sigma ^2 \end{aligned}$$
(7)
$$\begin{aligned} \varDelta G= & {} -\langle W_A \rangle + \frac{1}{2} \beta \sigma ^2 \end{aligned}$$
(8)

where \(\langle W_{G/A} \rangle\) are the mean values of the growth work and of the annihilation work with inverted sign. The quantity \(\frac{1}{2} \beta \sigma _G^2 = \frac{1}{2} \beta \sigma _A^2=W_\mathrm{diss}\) is the dissipated work in the transformations, which must be identical in either directions. Provided that the sampling of the starting end-states has been adequate, the precision (e.g. the 95% confidence interval) of the two independent NES Gaussian estimates in Eqs. 7 and 8 increases or decrease with the square root of n and depends only on the sample variance \(\sigma ^2\) [23, 24, 26, 36]. The normality of the distributions can be instantly checked [24] using standard procedures such as the Kolmogorv-Smirnov, the Wilk-Shapiro, the Jarque-Bera or the Anderson-Darling tests [37].

Materials and methods in NES submissions

System preparation

In order to assess the performance of the “official” versions of the most popular force fields (FF), we submitted multiple NES predictions adopting: the CgenFF parameter sets as obtained from the web interface “paramchem” [38, 39]; the GAFF2 topological and parameter files as obtained form the web interface “PrimaDORAC” [14]; the OPLS-AA parameter sets as obtained from the web interface “LigParGen” [17]. The GAFF2 atomic charges are computed by PrimaDORAC at the AM1/BCC level. For the OPLS-AA charges, the LigParGen option “1.14*CM1A-LBCC” was used. The charges in the CGenFF are assigned by analogy [39] by the paramchem web toolkit. All calculation were done with the program ORAC [40]. For each of the three FF’s, the parameter files were converted “as is” to the ORAC format with no further adjustment. The ORAC suite, inlcuding source code and documentation, can be freely downloaded from the website www.chim.unifi.it/orac.

For all submissions, solvation free energies were evaluated by dissolving the solutes in 1240 water molecules or 125 molecules of octanol in a cubic MD box. Hence in all cases we used “dry” 1-octanol as a solvent. The parametrization of the explicit water solvent in hydration free energy calculations is done using the recently developed OPC3 [41] three-point site model. For 1-octanol as a solvent, in the force field specific submissions, the parametrizations provided by the paramchem (CgenFF), PrimaDORAC (GAFF2) and LigParGen (OPLS-AA) web toolkits were adopted. The use of the common OPC3 model in all our NES submission is motivated by the fact that OPC3 reproduces with accuracy both the static dielectric constant and the density of water at standard conditions. Besides, as shown in Ref. [51], we showed that, while the effect of the selected force field for the solute molecule is important for the LogP prediction, the choice of the force field for the solvent water molecule is much less critical, with bearly detectable differences in the hydration energies when switching water models.

All simulations were done in the NPT isothermal-isobaric ensemble under periodic boundary conditions, yielding a mean side-length around 32–33 Å in both water and 1-octanol in all cases. The external pressure was set to 1 atm using a Parrinello-Rahman Lagrangian [42] with isotropic stress tensor. The temperature was held constant at 298 K using three Nosé Hoover-thermostats coupled to the translational degrees of freedom of the systems and to the rotational/internal motions of the solute and of the solvent. The equations of motion were integrated using a multiple time-step r-RESPA scheme [43] with a potential subdivision specifically tuned for bio-molecular systems in the NPT ensemble [42, 44]. The long range cut-off for Lennard-Jones interactions was set to 13 Å. Long range electrostatic were treated using the Smooth Particle Mesh Ewald method [45], with an \(\alpha\) parameter of 0.38 Å\(^{-1}\), a grid spacing in the direct lattice of about 1 Å  and a fourth order B-spline interpolation for the gridded charge array.

In Table 1, we show the computed density, \(\rho\), and static dielectric constant, \(\epsilon\), of pure 1-octanol using the three FF’s compared to the experimental values [46]. The average density and dielectric constant were calculated on 125 molecules of 1-octanol in the NPT ensemble at T = 298 K and P = 1 atm, running for 12 ns. The three force fields yield essentially the same \(\epsilon\) and \(\rho\) values and are in acceptable agreement with the experimental counterpart.

Table 1 Computed static dielectric constant and density (g/cm\(^3\)) of pure 1-octanol in standard conditions using various force fields. Experimental values are taken from Ref. [46]

The density and dielectric constant of OPC3 water in standard condition are 0.996 ± 0.01 g/cm\(^{3}\) and 78 ± 4, respectively [41]. The corresponding experimental values are 0.997 g/cm\(^{3}\) and 79 [46]

end-states HREX simulations

The Hamiltonian Replica exchange simulations in (w/o) solution and gas-phase for each of the 11 solute molecules are done using torsional tempering. Torsional tempering, a specialized solute tempering [47] scheme described in detail in Ref. [33], allows to surgically enhance the sampling on the relevant degrees of freedom of the system keeping the replica number to a minimum. For the compounds of Fig. 1, the scaling involves all the torsional potentials (including 1–4 non bonded interactions) around the rotable bonds connecting the planar rigid units, using a minimum scaling factor of \(c=0.1\), corresponding to a “torsional temperature” of 3000 K. Only the scaling factors are communicated among replicas, minimizing inter-processor communications. The torsional GE space is covered using only four replicas, with the scale factors [33] given by \(c_m=c^{(m-1)/3}\). Each system at the end-states \(\lambda =0\) and \(\lambda =1\) was simulated using HREX for 8 ns in the target state (hence 32 ns in total for each solute molecule), saving 420 phase space point every 20 ps for the later NES stage. Further technical details on the HREX simulations (round trip times, torsional energy overlap, GE state distributions of the walkers) are provided in Ref. [31] for specific examples.

NES stage

In the NES stage, for each compound of Fig. 1, in a single parallel job, 420 annihilation trajectories were started in the isothermal-isobaric ensemble in standard conditions, reading the initial phase points, harvested in the preceding HREX at full coupling \(\lambda =1\). In each of these NE annihilation trajectories, the solute was decoupled up to \(\lambda =0\) in 150 ps in water and 300 ps in 1-octanol. The annihilation times are chosen such that the mean dissipated work is approximately the same in the two solvents. The detail of decoupling protocol involves the linear discharging of the solute in the first 30 ps (water) and 60 ps (1-octanol), followed by the Lennard-Jones decoupling in the remaining 120 ps (water) and 240 ps (1-octanol). For the Lennard-Jones decoupling, we used a soft-core Beutler potential [48] regularization as \(\lambda \rightarrow 0\). Such NES protocol was chosen on the basis of past experience on NES solvation free energy calculation for molecules of comparable size [10, 49,50,51]. For each of the 11 compounds, 4 work histograms were produced, i.e. \(P_G(W), P_A(-W)\) in water and in 1-octanol. All 44 work histograms for the three FF’s, produced in the NES stage, are reported in Figures S1–S3 of the Supporting Information (SI).

The fast-growth NES stage was started combining 420 HREX phase space points of the solute in the gas-phase with equilibrated and decorrelated snapshots of the pure solvent, that is, inserting the solute as a ghost molecule in a random position and with random orientation in the equilibrated solvent configurations. The fast growth NES protocol corresponds to the inverted time schedule of the annihilation stage, i.e. in water and in 1-octanol the solute-solvent Lennard-Jones potential with soft-core regularization is first switched on in 120 and 240 ps, followed by the recharging process up to 150 and 300 ps, respectively.

LogP estimates

For each of the three FF, we submitted two blind predictions. The “challenge” prediction (NES-2) is done using both the growth and annihilation NE work values, exploiting the BAR bidirectional estimate, Eq. 6. The error was computed using bootstrap with resampling and corresponds to 1.96 times the square root of the \(\varDelta G\) variance (i.e. \(\alpha =0.05\), 95% confidence interval). In the second “fast” prediction (NES-1), the LogP was determined using only the fast-growth unidirectional estimates for the solvation free energies. The forward work distribution, \(P_G(W)\), was checked for normality according to the Anderson Darling (AD) test defined via the quantity \(A^2=\sum _{i=1}^n \frac{2i-1}{n} [ \ln (\varPhi (w_i) + \ln (1-\varPhi (w_{n+1-i}) ]\), where \(\varPhi\) is the Gaussian cumulative distribution function with sample mean and variance and \(w_i\) are the work values sorted in ascending order. The critical value of \(A^2\) at the level \(\alpha =0.05\) is 0.752 [52]. If \(A \le 0.752\), the work distributions were assumed to be normal and the solvation free energies were computed using the unbiased unidirectional Gaussian estimate, Eq. 7. In case of AD failure, the unidirectional Jarzynski estimate, Eq 4, was used. As for the first blind prediction, also for the forward (growth) unidirectional estimates the error on the solvation free energies are evaluated using bootstrap with resampling. Raw results for all six blind predictions are reported in the SI (Tables S1–S6).

Efficiency considerations

The “challenge” BAR-based blind prediction, on a per solute molecule basis, required a total of 424 ns for a system of approximately 3000 atoms (64 ns for the HREX stage and 360 ns for the forward and reverse NES stages). The fast-growth blind prediction, on a per solute basis, required a total of 180 ns (with a negligible cost of the HREX on the isolated molecule) for a system of approximately 3000 atoms. All computations were done with the OpenMP/MPI hybrid version of the ORAC program [40] on the 24K cores CRESCO6 ENEA cluster equipped with Intel Skylake 48 cores CPU 2.4 GHz. The “challenge” prediction were computed submitting four batch parallel job scripts, each processing sequentially all 11 compounds, namely two HREX simulations (water and 1-octanol) at full coupling and two subsequent NES (water and 1-octanol) and were completed in two wall clock days. The fast-growth predictions were obtained by submitting just one batch parallel job script (computing sequentially the fast-growth in water and in 1-octanol for all compounds) and was completed in few wall clock hours. Examples of these batch submission scripts are reported in the SI.

Overview on MD-based SAMPL6 submissions

31 MD-based submission were uploaded in the SAMPL6/LogP challenge. Of these, 30 were performed using the alchemical approach; 24 submissions used the FEP or TI equilibrium technique, and 6 the NES method. No other group, except for ourselves, used the NES method for LogP calculations.

Table 2 Overall performances (in bold font) for MD-based alchemical approach for the the GAFF, CGenFF and OPLS-AA parametrization

In 8 submissions (6 FEP, 2 NES) the CgenFF was used. In some cases, refined versions of the same force field were used processing the paramchem parameters with the “lsfitpar” refinement program [53] ; In 9 submissions (6 FEP, 1 TI, 2 NES), the GAFF1 or GAFF2 force field was used with atomic charges computed at the AM1/BCC level or at higher QM level. In 6 submission (3 TI, 1 FEP, 2 NES) the OPLS-AA LigParGen parameter sets were used in all cases. The latter six submissions used the very same LigParGen generated force field setup for the solute molecule and hence provide a subset for comparing the performances of equilibrium and non equilibrium methodologies. In one case (submission nh6c0, one of the best performing MD-based blind predictions), the FF original GAFF1 parameters were manually adjusted. The remaining MD-based submissions used polarizable force fields. Water was described using the OPC3 model (only in the 6 NES submission), the TIP3P model [54] (4 submissions), the TIP4P model [54] (4 submissions) and the SPCE model [55] (2 submissions). In remaining MD-based instances the water model was not specified. In all submissions using classical non polarized FF’s, the 1-octanol was modeled using the reference force field. Dry and wet 1-octanol was used in 15 and in 8 submission, respectively. The FEP or TI protocols used a number of \(\lambda\) windows from a minimum of 12 to a maximum of 20, and each \(\lambda\) state was simulated from few ns in water up to 20 ns for 1-octanol. Details of all MD-based submissions using non polarizable force fields are reported in Table S7 of the SI.

Fig. 2
figure 2

Experimental-calculated LogP correlation plots for MD-based submission for CGenFF, GAFF and OPLS-AA force fields. NES-2 and NES-1 indicate the bidirectional and unidirectional NES submissions, respectively

In the Table 2 and in Fig. 2 we show the overall performances of the alchemical methods as a function of the adopted FF (whether refined or not). Except for the submissions using polarizable force fields (not reported in Fig. 2) showing consistently (and surprisingly) poor results, the performances for the three fixed-charges FF’s reported in Table 2 are somewhat contradictory depending on the selected quality indicator. CgenFF appears to be the best performing FF for the MUE, but has the lowest Pearson and Kendall coefficients. Surprisingly, among the less accurate CgenFF submissions, some (see Table S7 in the SI for details) were done using a refined force field due to a positive “penalty score” obtained from the paramchem toolkit. This is somewhat puzzling given that, apparently, the “refined” CGenFF submissions were done using exactly the same FEP protocol of the best performing standard (paramchem) CGenFF predictions.

At variance with CgenFF, GAFF yields larger MUE's, consistently overestimating the LogP for virtually all uploaded submissions (see Fig. 2, central panel). On the other hand, GAFF outperforms CGenFF for both the Pearson and the Kendall rank coefficients with no striking degradation or increase of performances when the AM1-BCC charges are replaced by high level QM atomic charges (see Table SI7 in the SI for details). The OPLS-AA FF lies somewhat in between the CgenFF and GAFF FF’s. The systematic overestimation of the LogP’s, although still evident (see Fig. 2), is less pronounced with respect to GAFF, as measured by a somewhat smaller overall MUE. All six OPLS submissions, on the other hand, exhibit a good overall correlation and ranking coefficients. The LogP overestimation in GAFF and OPLS-AA is likely due to the fact that the Lennard-Jones and electrostatic balance in the parametrization has been in general trained over hydration free energies of small molecules [13, 16]. Thus, the electrostatic contribution to the solvation energy could be somewhat overestimated using water trained atomic charges in the apolar 1-octanol solvent, eventually producing systematically higher LogP values.

On the overall, as Table S7 in the SI shows, the use of wet 1-octanol did not appear to improve appreciably the performances. Both FF and reported methodological errors (ranging from 0.1 to 2 LogP units) are seemingly larger than the differences obtained in specific cases for wet or dry 1-octanol. For example, for OPLS-AA, the best performing submission was obtained using dry octanol, as for CGenFF. For GAFF, the submission bearing the lowest MUE and highest R and Kendall coefficients, done using wet 1-octanol, is only marginally better than other submissions obtained using dry 1-octanol. This is so since in wet 1-octanol, in spite of the high molar water fraction of 0.27, one molecule of water per four molecules of 1-octanol translates in a water/octanol volume fraction of only \(\simeq 0.03\) with a limited impact (a slight decrease) on the static dielectric constant [56].

Further details of the MD-based submissions are shown in Fig. 3, where we report the statistics of all MD-based prediction sets (classified according to the FF) in the SAMPL6 challenge for each of the compounds of Fig. 1. The experimental values, the NES-2 and NES-1 submissions are indicated as green circles, red filled triangles and a red stripes squares, respectively. For CGenFF, NES predictions exhibit the largest deviation for SM02, as most of the other submissions. For all other compounds, NES is consistently among the best performing methods. In case of GAFF, as previously states, all prediction sets appear to overestimate the LogP. NES submissions are found in most cases in correspondence of the maximum of the overall SAMPL6 GAFF distribution. For OPLS-AA, overestimation of LogP is still evident, and, again, NES submissions are systematically found among the best performing alchemical methods.

Fig. 3
figure 3

Detailed NES statistics of MD-based submissions in SAMPL6 challenge for all compounds, classified according to the adopted FF. Green circles: experimental data. Red triangles: NES-2 (bidirectional) submission. Red stripes squares: NES-1. (unidirectional) submission

NES results

The performances of the BAR-based bidirectional NES “challenge” submissions, NES-2, shown in parenthesis in Table 2 and as the black circles in the Fig. 2, are consistently among the highest ranking for each of the FF’s. Only for the CGenFF submission (see Table S7 in the SI), the unidirectional estimate NES-1 turned out to be only marginally better than the bidirectional estimate NES-2, very likely reflecting just a lucky “shot on goal”.

At variance with FEP or TI submissions, all NES submissions are from moderately to strongly mutually correlated, irrespective of the adopted force field or of the kind of estimate, unidirectional or bidirectional. In Table 3, we show the correlation matrix obtained with NES for the three force fields. In the upper and lower triangles we report R and \(\tau\) correlation coefficients obtained with NES-2 and NES-1, respectively.

Table 3 NES correlation matrix; Upper triangle NES-2 correlation; Lower triangle NES-1 correlation

As it can be seen from Table 3, for the three NES-2 submissions the mutual correlation R between FF’s goes from a minimum of 0.34 (CgenFF-OPLS-AA and CgenFF-GAFF2) to a maximum of 0.94 (OPLS-AA-GAFF2). The mutual Kendall rank coefficient \(\tau\) behaves similarly. Mutual correlations do not vary significantly among the three NES-1 unidirectional estimates, showing similar R and \(\tau\) indices among the various FF’s.

The NES-2 and NES-1 correlation, already evident from Fig. 3, can be further assessed in the Fig. 4, where we compare the bidirectional “challenge” NES-2 and unidirectional fast-growth NES-1 predictions for all 11 compounds, and for each FF. In the Figure we also report the error on the LogP with the two estimates. As it can be seen from the Figure, the optimal NES-2 and the “fast” NES-1 estimates are strongly correlated in all three cases. In case of CGenFF and GAFF, NES-2 and NES-1 LogP are strikingly similar, yielding a Pearson coefficient of 0.98 and 0.93. Correlation (R = 0.78) is only moderately degraded for

Fig. 4
figure 4

LogP coefficients computed using NES-2 and NES-1 approaches

OPLS-AA. This fact can be easily explained inspecting Tables S4–S6 in Figures S1–S3 in the SI. The OPLS-AA work distributions failed the AD test in 10 cases in water or 1-octanol compared to the three cases of CgenFF (SM08 SM11 SM15) and the two cases of GAFF2 (SM08 SM16). Hence for the OPLS-AA FF, the less accurate and precise Jarzynski estimate, Eq. 4, was used for the majority of the compounds, at variance with GAFF2 and CgenFF were the unbiased Gaussian estimate, Eq. 7, was mostly used.

As expected, the confidence intervals (reported as bar plots in the Fig. 4) are larger for the unidirectional LogP estimates with respect to the bidirectional BAR computed LogP. NES-2 uses in fact, twice as much work values with respect to NES-1. It is of note that the largest deviations between the NES-2 and NES-1 LogP estimates are seen in general in correspondence of large NES-1 errors, as in SM08 for CGenFF and OPLS, and in SM16 for GAFF2. Again, large errors are due to nature of the unidirectional estimates in the solvation free energies, obtained from the forward work histograms as assessed by the Anderson Darling normality test (see Tables S4–S6 in the SI): if either \(\varDelta G_\mathrm{oct}\) or \(\varDelta G_\mathrm{wat}\) or both have been evaluated using the Jarzynski exponential averages Eqs. 4 in lieu of the Gaussian unbiased estimate Eq. 7, then the confidence interval increases decisively and the LogP becomes less accurate. We must stress that the computational cost of NES-1 calculation of the LogP for one compound of the series is entirely due to the NES stage as the cost of the HREX sampling for a single molecule is negligible. NES-1 is hence a matter of few tens of wall-clock minutes on a Tier1 HPC system as the CRESCO6 cluster provided by ENEA [57]. In our case, the unidirectional water and 1-octanol solvation energies were computed (see technical details in the SI) running on about 3500 cores (about 1/8 of the CRESCO6 cluster) in less than a hour (15 min for water and 30 min for 1-octanol), yielding an average 95% confidence interval not exceeding one LogP unit. Accepting a confidence interval of 80% (i.e. running just 100 trajectories instead of 420), a dedicated CRESCO6 Tier-1 cluster can compute about 1000 NES-1 LogP coefficients per day.

We conclude this section by examining in detail the behavior of the FF’s for the compounds of the SAMPL6 challenge in terms of the electrostatic and Lennard-Jones contributions to the solvation free energy in water and in 1-octanol. The electrostatic contribution (QQ) was obtained from the reverse annihilation, evaluating the work distribution at the end of the discharging process (i.e. \(\tau =30\) ps and \(\tau =60\) ps for water and 1-octanol, respectively). We used the unidirectional estimate (Gaussian of Jarzinski depending on the AD test) on the QQ work distributions. The Lennard-Jones contribution (LJ) was computed from the growth process evaluating the work at \(\tau =120\) and \(\tau =240\) in water and 1-octanol, respectively. In this case all LJ work distributions, with no exception, were found normal and the Gaussian estimates was hence used in all cases. The results of this analysis are summarized in Fig. 5 where the values of the \(\varDelta G_{QQ}\) and \(\varDelta G_{LJ}\) contributions are reported for each of the compounds of Fig. 1. As expected, in water (see Fig. 5, top panel) the major contribution to the solvation energy is consistently due to the \(\varDelta G_{QQ}\) free energy. In 1-octanol (Fig. 5, bottom panel), the situation is reversed. For this solvent, the \(\varDelta G_{LJ}\) contribution is significantly larger than \(\varDelta G_{QQ}\) for almost all compounds, with the notable exception of SM08, which was the only compound bearing a carboxylate group and where both CgenFF and GAFF2 predict a \(\varDelta G_{QQ}\) that is of the order of or larger than the \(\varDelta G_{LJ}\).

Fig. 5
figure 5

Electrostatic and Lennard-Jones contributions (see text for details) to the solvation in water (top panel) and in 1-octanol (bottom panel) for the series of SAMPL6 compounds

In spite of the evident correlation, the balance of the QQ and LJ contributions in three FF’s exhibits significant differences. OPLS-AA and GAFF2 yield very similar LJ contributions for virtually all cases and in both solvents. OPLS-AA, however, has a larger QQ contribution with respect to GAFF2 in both water and, to a somewhat less extent, in 1-octanol. Probably, the QQ contribution in 1-octanol is overestimated leading to the observed overestimation of the LogP coefficient in both OPLS-AA and GAFF2.

CgenFF, on the other hand, is characterized by smaller LJ contributions in many cases with respect to the other two FFs’ in both solvent. The CgenFF QQ contributions to the hydration free energy lie in between (except for SM02, SM14,SM15) with respect to those obtained using the OPLS-AA and the GAFF2 force fields. These subtle differences make apparently CGenFF more balanced with respect to OPLS-AA and GAFF2, yielding in many cases LogP with significantly smaller MUE’s. We finally note that for the clorurated compounds (SM04, SM12, SM16) the extra site representing the chlorine \(\sigma\)-hole [58] in CGenFF (and not used in OPLS-AA and GAFF2) do seem to have an appreciable impact on the corresponding MUE’s. CGenFF LogP for SM02, SM12 and SM16 are better than the corresponding OPLS-AA and GAFF2 values.

Conclusions

In this paper, we have given an overview on the MD-based blind predictions in SAMPL6 LogP challenge using the most popular non polarizable force fields, CGenFF, GAFF2 and OPLS-AA, in combination with the alchemical approach. Force fields produce in general moderately consistent predictions. No force field can be considered as optimal, although CGenFF, in its standard (paramchem) implementations, appear to be more balanced and predictive with respect to GAFF1 or GAFF2 and OPLS-AA, yielding in general smaller RMSE’s. On the other hand, OPLS-AA and GAFF2 show on the overall better Pearson and Kendall coefficients, both exhibiting a systematic overestimation of the LogP, possibly due to the overestimation of the electrostatic contribution to the solvation free energy in 1-octanol. Such systematic error could be possible rectified by rescaling the atomic charges of the solute in 1-octanol and/or the atomic charges of the 1-octanol molecules in the solvent. Results using polarizable force fields were surprisingly poor, showing that these force field still need adjustment in the balance between fixed atom-atom Lennard-Jones interaction and the Coulomb contributions due to the fluctuating atomic charges or dipoles.

NES emerges has a reliable tool for LogP prediction, systematically being among the top performing submissions in all force field classes for at least two among the various indicators (R, \(\tau\) or RMSE). Contrarily to FEP or TI equilibrium approaches, which yield apparently very disparate results, all independent NES prediction sets, irrespective of the adopted force field and of the adopted estimate (unidirectional or bidirectional) are, mutually, from moderately to strongly correlated (from 0.35 to 0.95). Remarkably, accuracy is only moderately degraded in the unidirectional (growth) NES-1 submissions, that are on other hand, extremely convenient from a computational standpoint: a single LogP can be computed in a matter of minutes on a Tier-1 HPC system such as the CRESCO6 ENEA cluster equipped with Intel Skylake 48 cores CPU 2.4 GHz. NES, at constant wall time cost provides a methodology that bypass the problem of the sampling issue in MD-based equilibrium approach. Poor convergence or inadequate solute conformational sampling along the alchemical coordinate could in fact be the primary cause for the observed disparity of FEP or TI submissions even when using the same simulation setup and force field. Strikingly, the same disparity in equilibrium alchemical applications, even when using similar simulation protocols, is observed for the statistical uncertainties (imprecision) of the LogP.

At variance with stratification methods that are based on equilibrium sampling on each of the strata, in the NES approach equilibrium is required only at the end-states. The phase-space sampling of the end-states can be acquired as accurately as possible by using highly parallel and efficient enhanced sampling techniques affording a fast and accurate canonical sampling of all relevant collective coordinates. In the subsequent NES stage, fast NE trajectories connect the equilibrium phase-space points of one end-state to the corresponding non equilibrium set of the other end state. Provided that the equilibrium sampling of the starting end-state has been adequate, the confidence interval in NES rigorously depends on the variance of a single work distribution, obtained crossing at fast speed the whole alchemical coordinate, and on the number of NE trajectories (or equivalently, on the HREX collected phase-space points). The confidence level in NES industrial projects can be controlled using only two parameters, that is (i) the number of NE trajectories and (ii) the length of the NE trajectories that control the final dissipation and variance.