Introduction

Partition coefficient (P) refers to the ratio of the equilibrium concentration of a substance in organic solvent to that in water, which is distributed in the mixture of the two immiscible solvents. Due to the hydrophobicity in organic solvent layer, logarithm of P (LogP) can represent the lipophilicity of solute molecules. Because 1-octanol serves as a prototype of organic solvents, LogP with respect to 1-octanol/water system is used as the most popular molecular descriptor. LogP values have thus been measured as an important physicochemical property pertinent to drug-likeness [1], toxicity [2], blood brain barrier to reach the central nervous system [3], and ADMET properties [47]. Furthermore, LogP values can also be related with the molecular permeability with respect to the cell membrane that has a lipophilic central layer [810]. In addition to the role of a yardstick to measure the molecular lipophilicities, the usefulness of LogP has also been appreciated in estimating the dehydration cost for binding of a small molecule to the receptor protein [11]. LogP is thus a representative of the most prevalent molecular descriptors to quantify the pharmacological properties of drug candidates.

Due to the necessity in the discovery of new drugs and materials, a great deal of efforts has been devoted to the development of the reliable computational methods for LogP prediction. It was assumed in a large part of the computational methods that molecular LogP values would be obtained by summing all the contributions from the individual atoms or from the dissected fragments [12, 13]. A number of quantitative structure-property relationship (QSPR) models with reasonable accuracy were also developed for LogP prediction using a variety of molecular descriptors [1418]. LogP values of small organic molecules were also predicted successfully by comparing the solvation free energies with respect to water and 1-octanol calculated based on the solvent-accessible surface area model [19], the extended solvent-contact approach [20], and 3D density distribution function [21].

Although the usefulness of LogP was well-appreciated in contemporary drug discovery, its weak correlation has been observed with the membrane permeability as well as with the aqueous solubility of some small molecules [2224]. These discrepancies stem in a large part from the presence of a polar hydroxyl group in 1-octanol, which would be the potential source of error in mimicking the hydrophobic environment [25, 26]. Hence, the logarithm of distribution coefficient (LogD) with respect to the cyclohexane/water partitioning system may serve as a good alternative for the general-purpose molecular descriptor because cyclohexane is an absolutely lipophilic solvent with no polar moiety. Actually, LogD differs from LogP in that the former is measured under consideration of all the possible ionization and tautomerization states of a substance instead of taking into account only a single tautomeric form [27]. LogD proved be the better descriptor than LogP in particular for the molecules capable of establishing the intermolecular or intramolecular hydrogen bonds [28].

Like LogP, LogD has been estimated with reasonable accuracy with various computational methods based on the parametrizations of molecular surface area [2931], molecular interaction fields [32, 33], and solubility-diffusion theory [34]. Driven by the successful predictions of molecular solvation free energy and LogP [20, 35, 36], we address the usefulness of the solvent-contact model in estimating the molecular LogD values through the participation in SAMPL5 blind prediction challenge for distribution coefficient. To improve the solvation free energy function required for computing the ratio of solute concentrations with respect to cyclohexane and water, we augmented the number of atomic parameters to cope with various chemical environments encountered in 53 SAMPL5 molecules. This modification would have the effect of enhancing the accuracy in predicting the solvation free energy and LogD because the electronic structures and bonding patterns peculiar to drug-like molecules in SAMPL5 dataset can be described appropriately by the extension of atom types. The fundamental assumptions to calculate molecular LogD with the extended solvent-contact model are presented and discussed. We also address the limitations to practical applicability inherent in the extended solvent-contact model, and suggest the reasonable methods for further improvement.

Theory and computational methods

When the solute molecules diffuse passively across the two immiscible solvents, the ratio of equilibrium concentrations of the solute in the two solvents yields the partition coefficient (P). The P value of a solute molecule (S) with respect to cyclohexane/water partitioning system can be defined as follows.

$${\text{P}} = \frac{{[S]_{chx} }}{{[S]_{wat} }}$$
(1)

Because P is expressed in the form of equilibrium coefficient for the diffusion of a solute molecule from water to cyclohexane, its LogP can be related with the difference in solvation free energies (ΔΔG 0) with respect to the two solvents. Here, the solvation free energy (ΔG sol ) refers to the free energy change for the transfer of a solute molecule from the gas phase to solvent. LogP of a molecule can thus be related with ΔΔG 0 as follows when the latter is given in kcal/mol.

$${\text{LogP}} = - \frac{{\Delta \Delta G^{0} }}{1.364}$$
(2)

whereas P is measured from the concentrations of a single neutral form, D is defined with respect to all the forms of a solute molecule available in the two solvents. D should therefore be expressed with the summations running over all the possible ionization and tautomerization states a solute molecule.

$${\text{D}} = \frac{{\sum\nolimits_{i} {\gamma_{i} [T_{i} ]_{chx} } }}{{\sum\nolimits_{k} {\gamma_{k} [T_{k} ]_{wat} } }}$$
(3)

Here, γ i and T i represent the activity coefficient and the concentration of a single ionization or tautomerization state of the solute molecule in each solvent, respectively.

When a solute molecule with low ionization constant dissolves into the two solvents to form dilute solutions, D can be approximated to P because the D value would be dominated by a single tautomeric form of the solute. We note in this regard that SAMPL5 molecules involve only weakly or hardly ionizable moieties such as carboxylic acid, amine, and phenol with the ionization constant smaller than 10−4. The LogD value of each SAMPL5 molecule can thus be estimated by the difference between ΔG sol in water and that in cyclohexane.

$${\text{LogD}} = \frac{{\Delta G_{sol}^{wat} - \Delta G_{sol}^{cyx} }}{1.364}$$
(4)

Here, ΔG wat sol and ΔG cyx sol refer to the solvation free energies of the solute at 298.15 K in water and in cyclohexane, respectively.

To calculate the LogD values using Eq. (4), we constructed the molecular solvation free energy functions based on the extended solvent-contact model as detailed in the previous papers [35, 36].

$$\Delta G_{sol} = \sum\limits_{i}^{atoms} {S_{i} \left( {O_{i}^{\hbox{max} } - \sum\limits_{j \ne i}^{atoms} {V_{j} e^{{ - \frac{{r_{ij}^{2} }}{{2\sigma^{2} }}}} } } \right)}$$
(5)

Here, Gaussian envelope function with respect to the interatomic distances (r ij ’s) between solute atoms is introduced to define the occupied volume to which the approach of solvent molecules is restricted. S i , O max i , and V i parameters represent the atomic solvation energy per unit volume, maximum atomic occupancy, and atomic fragmental volume, respectively. O max i and V i values are related with the volume of a solute atom in the isolated state and that in molecules, respectively. The negative and positive signs of S i parameter indicate the stabilization and destabilization of the solute atom, respectively, as a consequence of the interactions with solvent molecules. These three atomic parameters assigned to each atom type should be determined for the solvation free energy function to be used in LogD calculations. To optimize all the atomic parameters with respect to water and cyclohexane, a standard genetic algorithm was employed with the training set comprising the molecules for which experimental ΔG sol data were available for both solvents. As widely adopted in the literature, the σ value in Eq. (5) was set equal to 3.5 Å during the parameterizations.

With respect to partitioning the two-solvent system, the cyclohexane phase contains only 0.04 % of water at 298.15 K [37] in contrast to 4 % in 1-octanol/water system. Such an exceptionally high purity is not surprising because cyclohexane is the hydrophobic molecule with no polar group. Therefore, the experimental data for training set molecules were adopted without a composition-weighted correction as the reference ΔG sol values to optimize the atomic parameters in the solvation free energy function.

Preparation of training set

As a preliminary step to optimize the atomic parameters in the solvation free energy function, we had to prepare the training set containing a sufficient number of molecules whose experimental ΔG sol values were available for both water and cyclohexane. In contrast to the abundance of LogP values for a variety of organic molecules, the rarity of experimental LogD data for cyclohexane/water partitioning system made it difficult to establish a proper training set. Because the experimental ΔG sol data for cyclohexane were also insufficient to cope with all SAMPL5 molecules, the training set was constructed by combining the molecules with ΔG sol values for cyclohexane and water with those for which ΔG sol data were available for 1-octanol and water. This was inevitable due to the rarity of experimental ΔG sol data even for the other hydrocarbon solvents such as hexane. Actually, such a combination was the only way to optimize the atomic parameters of the atom types missing in the molecules for which the experimental ΔG sol values in cyclohexane were available. 1-Octanol is likely to serve as an effective surrogate for cyclohexane because it categorizes into a hydrophobic solvent. Indeed, 1-octanol has often been used as a simplified model system for lipid [38] because of high lipophilicity with the low dielectric constant of 10.3 due to the presence of a long hydrocarbon chain.

A total of 92 molecules were collected to construct the training set, the structures of which are illustrated in Fig. 1. The ΔG sol values in cyclohexane for 77 elements served as the reference data for parameterization while those for the rest 15 molecules were approximated with the corresponding ΔG sol values in 1-octanol. This minor subset included the molecules containing the atom types for sp 2 carbon, amide nitrogen, and sulfur. The ΔG sol values of most elements in the training set with respect to cyclohexane, water, and 1-octanol were extracted from Minnesota Solvation Database of version 2012 [39], while those of the remaining molecules including sp 2 carbon, planar and amide nitrogens with 3 substituents, and sp 2 sulfur with two substituents were retrieved from literature [40, 41].

Fig. 1
figure 1

Chemical structures of the selected molecules in the training set for the optimization of atomic parameters in the solvation free energy function. Asterisks indicate the molecules for which the experimental ΔG sol values in 1-octanol were referenced instead of those in cyclohexane

Definition of atom types

The contributions of individual atoms to solvation free energy vary even among the same elements due to the diverse chemical environments with which the atoms in molecule are faced. The atom types should therefore be assigned under consideration of the detailed atomic properties including the hybridization state, electronegativity, and the number of substituents. For example, the specific atom types should be defined for the functional groups with characteristic electronic structure such as carbonyl carbon, phenolic oxygen, and the hydrogen atoms attached to varying heteroatoms. A total of 41 atom types were assigned in this study to discriminate the differences in the interactions with solvent among the atoms contained in 53 SAMPL5 molecules. The number of atom types reduced to 33 when the fifteen molecules without ΔG sol values for cyclohexane were excluded, which exemplified their necessity in the optimization of solvation free energy function. All atom types were designated in Sybyl MOL2 format for simplicity in discriminating the similar ones.

Optimization of atomic parameters

The molecular structures in the training set and in SAMPL5 dataset were fully optimized with ab inito quantum chemical calculations at B3LYP/6-31G** level of theory to prepare the atomic coordinates required to compute the solvation free energies. All the atomic parameters defined for 41 atom types were then determined with respect to cyclohexane and water to estimate the LogD values of SAMPL5 molecules based on the extended solvent-contact model. Because the atomic fragmental volume (V i ) parameters revealed a bad convergence during the simultaneous optimization with O max i and S i values, they were optimized in separate using a standard genetic algorithm as described in the previous papers [35, 36]. Due to the convergence problem, V i values were allowed to vary among all the atoms even with the same atom type. This criterion was necessary because the volumes of individual atoms depended on the overall structure of a solute molecule.

The optimization of V i parameters started with calculating the volumes (V mol ’s) of all the solute molecules. Each molecule was placed in a 3-D box whose length, width, and height corresponded to the maximum distances along the three axes for the coordinate system of molecular van der Waals volume. To construct the van der Waals volume, atomic radii of carbon, nitrogen, oxygen, sulfur, hydrogen, fluorine, chlorine, and bromine atoms are set to 1.53, 1.45, 1.36, 1.70, 1.08, 1.30, 1.65 and 1.80 respectively. Monte Carlo simulations were then carried out to calculate the V mol value by randomly selecting the grid points in the 3-D box embedding the solute molecule. More specifically, the V mol value was obtained by the product of the box volume (V box ) and the ratio of the number of trials to select a point in the molecular van der Waals volume (N hits ) to the total number of trials (N trials ). All the V mol values of the solute molecules were thus calculated with the following equation.

$$V_{mol} = V_{box} \times \frac{{N_{hits} }}{{N_{trials} }}$$
(6)

Using the calculated V mol values, the V i parameters were optimized by operating the standard genetic algorithm. This began with the definition of a generation with 100 vectors comprising the V i parameters for all the atoms in molecules, which was followed by the removal of 50 with a bias toward preserving the most fit with the lowest error. The empty 50 vectors were then filled with the point mutations to alter the value of one of the parameters with probability 0.01, and with the cross breeds with probability 0.6 to select some parameters from one vector to replace the elements of another vector of the top 50. The 50 new vectors created in this way were then evaluated together with the top 50. This cycle was repeated as many times as desired. To evaluate the 100 vectors, we used the error hypersurface (F v ) defined by the sum of the absolute values of the differences between the V mol value and the sum of V i values of a solute molecule.

$$F_{V} = \sum\limits_{k}^{molecules} {\left| {V_{mol}^{k} - \sum\limits_{i}^{atoms} {V_{i} } } \right|}$$
(7)

After the parameterizations of V i , O max i and S i values for all the atom types were optimized concurrently using the genetic algorithm to make the solvation free energy function suitable for calculating the ΔG sol values for cyclohexane and water. These second parameterizations began with the construction of a generation consisting of 100 vectors whose elements were O max i and S i parameters for all the available atom types. In the second step, 50 of 100 vectors were made empty with a bias toward the best fit with the lowest error. These vacant 50 vectors were filled again with the new elements generated by processing those of the remaining 50 vectors. The new vector elements were obtained by the point mutations with probability 0.01 to alter the O max i and S i values as well as by the cross breeds with probability 0.6 to exchange the corresponding elements in the top 50 vectors. The newly generated vectors were then combined with top 50 to be evaluated together. This procedure was repeated until the convergence criterion was met. The evaluation of each vector was carried out using the error hypersurface (F s ) given by summing over the differences between the ΔG sol values of the training set molecules measured from experiment (ΔG iexp ) and those calculated with the solvation free energy function (ΔG i calc ). This fitness function can be written in the following form.

$$F_{s} = \sum\limits_{i}^{molecules} { \, \left| {\Delta G_{\exp }^{i} - \Delta G_{calc}^{i} } \right|}$$
(8)

The optimizations tended to converge approximately after 100000 iterations for V i and 1000 for O max i and S i values.

Results and discussion

Chemical structures of the selected SAMPL5 molecules are shown in Fig. 2. We note that SAMPL5 dataset has a wide spectrum of shape and size with molecular weights (MWs) ranging from 170 to 810 amu in comparison to that of SAMPL4 which included only the small molecules with MW lower than 280 amu. This indicates that more rigorous computational methods would be required in SAMPL5 prediction challenge than those adopted for SAMPL4 molecules to get similar achievements in performance. Nonetheless, the augmentation of new atom types should be minimal in the extended solvent-contact model lest the optimization leads to overtraining due to the excessive atomic parameters.

Fig. 2
figure 2

Chemical structures of the selected molecules included in SAMPL5 dataset. The structures of all SAMPL5 molecules are presented in Supplementary Materials

It should also be noted that several SAMPL5 molecules can exist in different tautomeric forms. Although the accuracy in LogD prediction would be enhanced by considering the structural multiplicity, we used only the major tautomeric form of each SAMPL5 molecule for computational simplicity. For example, the enol form was adopted when the ring system involving the –OH moiety satisfied the aromaticity conditions as in SAMPL5_50, SAMPL5_56, and SAMPL5_83, whereas the keto form was selected in the other cases.

LogD values of SAMPL5 molecules are expected to be similar to LogP ones because they include only weakly or hardly ionizable groups such as carboxylic acid, amine, and phenol moieties, which belong to a weak acid/base with the ionization constant smaller than 10−4. All SAMPL5 molecules were therefore assumed to be neutral in this study to make it straightforward to determine their solvation free energies. Furthermore, the experimental LogD values of all SAMPL5 molecules were measured at the concentrations lower than 0.1 mM. It is difficult to form the solute dimer in such dilute solutions, which would have the effect of further reducing the difference between LogD and LogP values of SAMPL5 molecules. Taken together, LogD values of SAMPL5 molecules may be estimated with the solvation free energy functions optimized for water and cyclohexane.

To calculate the ΔG sol values of each SAMPL5 molecule in water and cyclohexane, all atomic parameters in the solvation free energy function were optimized with respect to both solvents using the experimental data for 92 training set molecules. Table 1 lists the optimized O max i and S i values for 41 atom types introduced to describe all the atoms in SAMPL5 molecules under a variety of chemical circumstances. V i parameters for all the atoms in SAMPL5 molecules are presented in Supplementary Materials. They have to be presented in separate from O max i and S i parameters because they can vary among the atoms with the same atom type. Despite the complexity in parametrizations, the O max i and S i values tend to vary in a manner consistent with general atomic properties. For example, the O max i parameters of the second-period atoms (C, N, O, and F) range from 270 to 400 irrespective of the solvent as compared to those of hydrogen atoms smaller than 250. In comparison, most O max i values of sulfur and bromine atoms exceed 400. This trend can be understood in terms of the conceptual similarity of the O max i parameter to the atomic volume.

Table 1 Optimized O max i and S i parameters for 41 atom types in SAMPL5 molecules

The S i parameters appear to vary significantly with the atom types even among the same elements whereas the O max i values for the atoms in the same period are relatively similar. For instance, the S i value of carbonyl carbon (C.CO_2) with respect to water is even more negative than those of alkyl and aromatic carbons. This may be related with the accumulation of positive charge on the carbonyl carbon, which stems from the electron withdrawal by the adjacent carbonyl oxygen. As can be expected from the large differences in physicochemical properties between the two solvents, the S i parameters for water are quite different from those for cyclohexane. We note in this regard that all the S i values corresponding to carbon atoms converge to the negative values in cyclohexane as a consequence of reflecting the attractive van der Waals interactions between solute carbon atoms and the solvent. On the other hand, the S i values of carbon atoms become positive or less negative in water due to the weakening of hydrophobic interactions with solvent.

Consistent with the major contributions of nitrogen and oxygen atoms to molecular solubility in polar solvents, their S i values for most atom types are optimized to be highly negative in water. Actually, the interactions of polar solute atoms with water molecules are expected to be attractive because they can be stabilized in aqueous solution not only by the long-range electrostatic interactions with bulk solvent but also by the local hydrogen bonds with solvent molecules. However, most S i values of nitrogen and oxygen atoms become much less negative or positive with the change of solvent from water to cyclohexane. This can be understood in the context of the weakening of solute-solvent interactions due to the lack of polarity in solvent molecules. In case of hydrogen atoms, the S i parameters tend to be more negative as the adjacent atom changes from carbon to polar atoms, which can also be attributed to the facilitation of the electrostatic interactions with solvent.

Using the V i , O max i , and S i parameters of all 41 atom types optimized with 92 training set molecules, the solvation free energies of SAMPL5 molecules were calculated for water and cyclohexane to produce the ultimate LogD values. The correlation diagram of experimental versus calculated ΔG sol values of 92 molecules in the training set and that of experimental versus calculated LogD values of 53 SAMPL5 molecules are shown in Figs. 3 and 4, respectively. All ΔG sol values for the molecules in the training set and SAMP5 dataset are provided in Supplementary Material. Although a good correlation is observed between experimental and computational solvation free energies of training set molecules with the correlation constant (R) larger than 0.96, the prediction accuracy decreases significantly in the estimation of LogD data for the SAMPL5 molecules with the associated R value of 0.55. The average error (AE) and root mean square error (RMSE) amount to 1.53 and 3.03, respectively, which rank 33th and 17th among 62 participants in SAMPL5 prediction challenge for the unitless LogD values.

Fig. 3
figure 3

Correlation diagrams for the experimental versus calculated solvation free energies of 92 molecules in the training set with respect to A cyclohexane and B water. Indicated in red circles are the training set molecules for which the experimental ΔG sol values in 1-octanol were referenced

Fig. 4
figure 4

Correlation diagram between the experimental and calculated LogD values of 53 SAMPL5 molecules whose atomic parameters are optimized with 92 training set molecules. The upper and the lower red circles indicate SAMPL5_80 and SAMPL5_74, respectively, which reveal a large deviation between the experimental and calculated LogD values

With respect to the modest accuracy in LogD prediction, we note that the training set could not be constituted completely with the molecules for which the experimental ΔG sol data were available for both water and cyclohexane because they lacked some atom types present in SAMPL5 molecules. For example, the atom types for sp 2 carbon (C.2_1, C.2_2, and C.2_3) were missing in the molecules with the experimental ΔG sol values for cyclohexane. Therefore, the roles of the elements of training set possessing the sp 2 carbon had to be played by the molecules with the experimental ΔG sol values for 1-octanol. The same was true of the training set molecules containing the atom types of N.pl_3, N.am_2, N.am_3, S.12, and S.3_2. These vicarious selections of the training set molecules can lead to the incomplete optimization of the solvation free energy function for cyclohexane, which would culminate in the impairment of accuracy in LogD predictions. Indeed, the largest differences between the experimental and calculated LogD values are observed in SAMPL5_074 and SAMPL5_080 as indicated in Fig. 4, both of which contain at least two atom types missing in the molecules with the experimental ΔG sol data for cyclohexane.

We now turn to the second prediction challenge for LogD values with the subset of SAMPL5 molecules for which all the atomic parameters can be fully optimized with the reference ΔG sol data for cyclohexane. By comparing the new results with those for all SAMPL5 molecules, it would be possible to address the influence of replacing the ΔG sol values for cyclohexane with those for 1-octanol on the accuracy in LogD prediction. This comparative analysis started with the reoptimization of solvation free energy functions using only the training set molecules for which experimental ΔG sol data were available for cyclohexane. Accordingly, we excluded some SAMPL5 molecules containing the atom types missing in the new training set. As a consequence, 77 and 31 molecules remained in the training set and SAMPL5 test set, respectively, along with the decrease in the number of atom types from 41 to 33.

Table 2 lists the newly optimized atomic parameters using the modified training set with the same procedure as described in the previous section. The S i parameters for water and the O max i values for both solvents remain strongly correlated with those in the parameterizations with the full training set (Table 1). The R values associated with the comparisons of the new and previous S i (water), O max i (cyclohexane), and O max i (water) parameters amount to 0.86, 0.79, and 0.88, respectively. Some general tendencies are therefore also found in the newly optimized O max i and S i parameters of varying atom types. For example, the S i values of most carbon atoms are negative and positive in cyclohexane and in water, respectively. The electronegative nitrogen and oxygen atoms have highly negative S i values with respect to water, which is consistent with their major contributions to the stabilization of a parent molecule in aqueous solution. However, the newly optimized S i values with respect to cyclohexane appear to become quite different from those obtained with the full dataset with the associated R value of 0.13. This indicates that the calculated ΔG sol values of SAMPL5 molecules for cyclohexane can vary significantly due to the removal of the training set molecules lacking the reference ΔG sol data for cyclohexane, which would in turn have the effect of changing the results of LogD prediction in a large part.

Table 2 Optimized O max i and S i parameters for 33 atom types in SAMPL5 molecules for which the experimental ΔG sol values are available for cyclohexane

The correlation diagrams of experimental versus calculated ΔG sol values of 77 training set molecules with 33 atoms types are shown in Fig. 5. Both R values for cyclohexane and water remain similar to those obtained with the original training set comprising 92 molecules and 41 atom types (Fig. 3). As shown in Fig. 6, however, the accuracy in LogD prediction appears to be improved remarkably with the increase of R value from 0.55 to 0.82. We note that the R value becomes close to the top-ranked one (0.84) in SAMPL5 blind prediction challenge for LogD. This accuracy enhancement is apparently attributed to the exclusion of the training set molecules without the reference ΔG sol values for cyclohexane. In particular, the S i parameters for cyclohexane seem to be optimized better than before by limiting the elements of training set to the molecules for which the experimental ΔG sol data are available, which leads to the better prediction of ΔG sol values for cyclohexane and culminates in the accuracy enhancement in LogD predictions. This result exemplifies the importance of constructing a proper training set for the extended solvent-contact model to be useful for predicting the physicochemical properties of drug-like molecules.

Fig. 5
figure 5

Correlation diagrams for the experimental versus calculated solvation free energies with respect to A cyclohexane and B water for 77 training set molecules for which experimental ΔG sol values are available for both solvents

Fig. 6
figure 6

Correlation diagrams between the experimental and calculated LogD values of 31 SAMPL5 molecules for which all atomic parameters can be optimized using 77 training set molecules with experimental ΔG sol data for cyclohexane. The upper and the lower red circles indicate SAMPL5_065 and SAMPL5_081, respectively, which reveal a large deviation between the experimental and calculated LogD values

Listed in Table 3 are the LogD values of SAMPL5 molecules calculated with and without the ΔG sol data for 1-octanol in the training set in comparison with the corresponding experimental results. Consistent with the increase in R value, both AE and RMSE decrease from 1.53 and 3.03 to 0.89 and 1.60, respectively, due to the modification of the training set. It is remarkable to note that the RMSE value becomes lower than that of the best scored one in the SAMPL5 blind prediction challenge for LogD. Although it makes little sense to compare our new computational results with those obtained with the full SAMPL5 dataset, it can at least be argued that the extended solvent-contact model would be one of the most efficient methods for LogD prediction upon the availability of sufficient experimental ΔG sol data for cyclohexane.

Table 3 Comparison of experimental and calculated LogD values of SAMPL5 molecules

With respect to the improvement of the accuracy in LogD prediction, the S i parameters of the planar nitrogens bonded to aromatic rings appear to change most significantly in the optimizations with the new training set. For example, the S i values of N.pl_1 and N.pl_2 for cyclohexane decrease from 3.143 and −1.667 (Table 1) to −12.302 and −5.079 (Table 2), respectively, due to the exclusion of the molecules lacking the reference ΔG sol data for cyclohexane in the training set. The highly negative S i values of planar nitrogens can be understood in the context that their hydrophobic interactions with cyclohexane molecules would be facilitated along with the delocalization of lone-pair electrons into the adjacent aromatic ring, which has the effect of decreasing the polarity on the nitrogens. In this regard, the S i values seem to be more negative in cyclohexane than in 1-octanol because the former is more hydrophobic than the latter. Because N.pl_1 and N.pl_2 are the most abundant heteroatoms in the SAMPL5 dataset, a significant enhancement in LogD prediction is anticipated by the better optimization of their S i values with respect to cyclohexane. Indeed, the deviations between the experimental and calculated LogD values of SAMPL5_015, SAMPL5_027, and SAMPL5_065 including the planar nitrogens appear to decrease significantly from 1.20, 1.87, and 5.78 to 0.09, 0.81 and 2.74 (Table 3), respectively, along with the modification of the training set.

Despite the considerable accuracy enhancement in LogD prediction by modifying the training set, a large discrepancy between experimental and computational results is still observable for some molecule such as SAMPL5_065 and SAMPL5_081 as indicated in Fig. 6. We note in this regard that several atoms types (C.3_3, N.3_3, N.pl_2, N.am_1, O.pl_2, and O.es_1) appear only once or twice in the training set due to the rarity of experimental ΔG sol data for cyclohexane. Therefore, it seems to be difficult for the atomic parameters to be fully optimized in such a way to reflect various chemical circumstances around the atoms in molecules during the parameterizations. The low occurrences of the six atom types in the training set are likely to serve as one of the major error sources in LogD prediction because they are present in a number of SAMPL5 molecules.

It is thus found to be a drawback of the extended solvent-contact model to require a sufficient amount of experimental data for the optimization of solvation free energy function. However, this requirement seems not to be severe because LogD values of 31 SAMPL5 molecules were estimated with reasonable accuracy using only 77 training set molecules. The characteristic feature that discriminates the extended solvent-contact model from the other computational methods lies in that one can calculate the molecular LogD values straightforwardly with the solvation free energy functions and the atomic coordinates of solute molecules. This is in contrast with quantitative structure-property relationship (QSPR), quantum mechanical, and statistical simulation methods that require a high computational cost for calculating the molecular descriptors, the electronic structures, and the trajectories in configurational space, respectively. Because of the simplicity in model building and little computational burden for parameterizations, the extended solvent contact model is expected to serve as one of the most efficient computational methods for LogD prediction upon the enrichment of experimental ΔG sol data for organic solvents.

With respect to the accuracy enhancement in LogD prediction, it also noteworthy that the solvation free energy function in Eq. (5) lacks the entropic term. Although the determination of molecular solvation entropy had been considered very difficult for a long time, it proved recently to be estimated with accuracy by means of combining the free energy perturbation method and the scaled particle theory to calculate the electrostatic and hydrophobic contributions of solvent-solute interactions, respectively [42]. Because both enthalpic and entropic contributions to the solvation free energy are experimentally measurable, the potential parameters in the two terms can be optimized independently using the corresponding reference data. This dual parameterization would warrant the better prediction of solvation free energies than the single parameterization because more diverse experimental data can be referenced. Our future studies will focus on the improvement of LogD prediction accuracy through the modification of solvation free energy function by implementing the solvation entropy term.

Conclusions

We addressed the applicability of the extended solvent-contact model to the calculation of molecular LogD values through the participation in SAMPL5 blind prediction challenge. After defining the atomic parameters for 41 atom types to describe a total of 53 SAMPL5 molecules, the solvation free energy function was optimized with respect to water and cyclohexane using 92 training set molecules to obtain the ΔG sol values required to calculate LogD. Due to the deficiency of experimental data for cyclohexane, the reference ΔG sol values of 15 training set molecules were replaced with those for 1-octanol. The LogD values of SAMPL5 molecules were predicted with modest accuracy with the R, AE, and RMSE values of 0.55, 1.53, and 3.03, respectively, for the comparison of experimental and computational results. The incomplete optimization of the atomic S i parameters with respect to cyclohexane proved to be the major source of error in LogD prediction. The R, AE, and RMSE values could be improved remarkably to 0.82, 0.89, and 1.60, respectively, when the predictions were made for 31 SAMPL5 molecules containing the atom types for which the experimental reference ΔG sol data were available for cyclohexane. This considerable enhancement in performance stemmed from the better parameterization of S i values by limiting the element of training set to the molecules with experimental ΔG sol data for cyclohexane. Most significant improvements in LogD prediction were observed for the SAMPL5 molecules including the planar nitrogens whose attractive van der Waals interactions with cyclohexane could be described appropriately only with the S i values optimized by using the modified training set. Judging from the simplicity in model building and from the low computational cost for parametrizations, the extended solvent-contact model is anticipated to serve as a valuable computational tool for LogD prediction upon the enrichment of experimental ΔG sol data for cyclohexane.