SAMPL6 logP challenge: machine learning and quantum mechanical approaches

Patel, Prajay; Kuntz, David M.; Jones, Michael R.; Brooks, Bernard R.; Wilson, Angela K.

doi:10.1007/s10822-020-00287-0

SAMPL6 logP challenge: machine learning and quantum mechanical approaches

Published: 30 January 2020

Volume 34, pages 495–510, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

SAMPL6 logP challenge: machine learning and quantum mechanical approaches

Download PDF

Prajay Patel¹,
David M. Kuntz²,
Michael R. Jones³,
Bernard R. Brooks³ &
…
Angela K. Wilson ORCID: orcid.org/0000-0001-9500-1628^1,2

840 Accesses
10 Citations
Explore all metrics

Abstract

Two different types of approaches: (a) approaches that combine quantitative structure activity relationships, quantum mechanical electronic structure methods, and machine-learning and, (b) electronic structure vertical solvation approaches, were used to predict the logP coefficients of 11 molecules as part of the SAMPL6 logP blind prediction challenge. Using electronic structures optimized with density functional theory (DFT), several molecular descriptors were calculated for each molecule, including van der Waals areas and volumes, HOMO/LUMO energies, dipole moments, polarizabilities, and electrophilic and nucleophilic superdelocalizabilities. A multilinear regression model and a partial least squares model were used to train a set of 97 molecules. As well, descriptors were generated using the molecular operating environment and used to create additional machine learning models. Electronic structure vertical solvation approaches considered include DFT and the domain-based local pair natural orbital methods combined with the solvated variant of the correlation consistent composite approach.

The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules

Article Open access 01 May 2020

Dataset’s chemical diversity limits the generalizability of machine learning predictions

Article Open access 12 November 2019

Machine learning for the prediction of molecular dipole moments obtained by density functional theory

Article Open access 22 August 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

A partition coefficient (P) represents the ratio of the concentration of a solute between two immiscible phases in an un-ionized state. If one of the phases is water, then the partition coefficient can be considered as a way of measuring the hydrophobicity or hydrophilicity of a substance. The coefficient, often expressed as a log of the ratios, is a useful estimator of how pharmaceutical drugs are absorbed.

While partition coefficients have been determined experimentally for decades, theoretical approaches toward calculating physicochemical properties such as solubility and ionization are increasingly being used in conjunction with experiment to aid in expediting the determination of properties. Theoretical methods are used in drug screening as well, to rapidly test a wide variety of pharmaceutical candidates and narrow down the most promising targets for future synthesis. However, theoretical approaches have drawbacks as well, notably that the complex nature of chemical systems is often difficult to model and that there are often no systematic ways to determine the accuracies of the predictions, especially when multiple models for theoretical predictions exist, nor systematically improve upon the predictions. For theoretical methods to be of use, they need to be able to accurately (and routinely) predict properties of chemical systems where experimental values are not known.

Statistical assessment of modeling of proteins and ligands (SAMPL) challenges serve as an excellent platform to assess current and identify new computational modelling approaches for use in drug discovery. The SAMPL challenges are blind competitions focused on predicting a variety of properties, including pK_a, binding affinities, and logP, in the absence of experimental data to ensure no bias in the consideration of computational models [1,2,3,4,5,6,7,8]. The SAMPL6 challenge for the prediction of physical properties was issued in two parts. The first part of the challenge entailed the prediction of aqueous microscopic and macroscopic pK_a values for 24 small drug-like molecules. While a wide range of methods were utilized in the competition, including pure quantum mechanical (QM), molecular mechanical (MM), QM/MM hybrid approaches, and database-based algorithms, several quantum chemical approaches performed better than parameter-dependent methods [1, 9,10,11]. As well, machine learning approaches were utilized, as they are becoming increasingly popular for predictive models [1]. Part II of the SAMPL6 challenge introduces the prediction of water-octanol partition coefficients (logP) for 11 of the 24 molecules from Part I [12].

In this work, two different approaches are used to predict the logP as a part of the SAMPL6 challenge. To build upon knowledge gained from the SAMPL5 distribution-coefficient challenge, electronic structure calculations in combination with continuum solvation models are used to predict the logP and gauge for optimal approaches that provide a balance between efficiency and quality. Composite approaches are designed to target results possible with a higher level of theory, but at a fraction of the computational cost (memory, disk space, CPU time). The correlation consistent composite approach (ccCA) is an ab initio composite approach that has been utilized for main group thermochemistry and was initially proposed as an alternative to the Gaussian-n composite methods [13, 14]. Over the past several years, variants have been developed and have been used to target a number of different thermodynamic properties of main group molecules. For example, Solv-ccCA utilizes implicit solvent models to predict the pK_as of amines and carbon acids within 1 kcal/mol with a direct thermodynamic scheme [15]. As well, DLPNO-ccCA combines the domain-based local pair natural orbital (DLPNO) methods with ccCA to enable ab initio composite methodologies to reach molecules of larger size than traditional composite methods, with a ~ 95–97% CPU time reduction relative to ccCA [16]. To account for the size of the SAMPL6 molecules and the calculation of partition coefficients between water and 1-octanol, solvation effects have been included within DLPNO-ccCA—denoted as DLPNO-Solv-ccCA—in this work.

As the use of machine learning techniques are increasingly becoming popular and have demonstrated utility, a machine learning approach is used here by interpolating a relationship between molecules with known logP values and similar structures to those of the challenge molecules. Machine learning approaches infer relationships between a set of known data and a property of interest and then use these relationships to predict the same property of interest for new data where the property is not known [17]. As such, machine learning techniques require many data points of known properties for training; however, they are able to make predictions faster on new data than electronic structure methods. Machine learning techniques in chemistry-related fields often utilize quantitative structure–activity relationships (QSAR) that allow a chemical system to be correlated to a property of interest via a parameter or descriptor [18,19,20]. QSAR descriptors are limitless and can be simplified to include the identities of atoms, bonds, and bond angles in a system, as well as charge, electronic, and thermodynamic properties, among others [21]. Machine learning algorithms are generally faster than algorithms for conventional electronic structure methods but require careful curation of the data used for training [22, 23].

Methods

For the logP challenge, the molecular structures were provided pursuant to the challenge, however, the experimentally determined logP values were withheld so that all attempts at predicting the logP values would be uninfluenced. After completion of the challenge, the experimental results were provided to gauge the predictions and the results from the current study then are compared to these results. The eleven challenge molecules from the SAMPL6 pK_a challenge that resemble fragments of small protein kinase inhibitors, using the issued tautomer state, were used to predict the octanol-water partition coefficient (logP). These molecules are shown in Fig. 1. The molecules are abbreviated as SM02, SM04, SM07, SM08, SM09, SM11, SM12, SM13, SM14, SM15, and SM16 in accordance to their naming scheme during the SAMPL6 pK_a challenge.

Machine learning and QSAR

A total of 97 molecules were selected from the NIST Standard Reference Database [24] of partition coefficients for use in the training models (Table 1). The molecules were chosen based on their structural similarity to the challenge set molecules, e.g. molecules with aromatic rings, heterocyclic rings, ketones, esters, ethers, alcohols, as well as molecules that contain N, O, F, and Cl. The corresponding recommended logP value was also retrieved for all molecules selected from the NIST database. SMILES strings were retrieved from PubChem [25] and were transformed into 3-dimensional coordinates via Open Babel [26]. Structures were optimized using the B3LYP density functional including Grimme’s D3 dispersion correction with Becke-Johnson dampening (B3LYP-D3) [27,28,29] and the correlation consistent polarized valence triple-ζ (cc-pVTZ) basis sets [30].

Table 1 Molecules used in the training data set

Full size table

For the training model, both 2D and 3D QSAR descriptors were generated using the optimized structures. Initially, 19 molecular descriptors were generated for each molecule in the training and challenge sets (Table 2). The van der Waals volume (VDWV) and area (VDWA) were calculated using the fast calculation approach from Zhao et al. [31]. This approach requires only knowing the number of bonds, count of each atom, and number of aromatic and nonaromatic rings present in each species, which are all invariant to the 3D structure. Incorporating the geometric topology, quantum chemical descriptors, including HOMO/LUMO energies, dipole moments, superdelocalizablities, and polarizabilities were derived from electronic structure calculations using the PBE density functional approximation in both gas and solvated phases. Approaches utilizing these descriptors are labeled as ML-1 and ML-2. To achieve a better correlation, the list of descriptors was expanded to include additional 2D and 3D descriptors such as molar refractivity and surface accessibility (Table S1). All additional QSAR descriptors were generated using MOE 2018 and used for approaches ML-3 and ML-4 [32]. Once descriptors were generated, feature selection and exploratory data analysis were performed. Feature selection helps prevent overfitting in the model by trimming excess descriptors to remove irrelevant features. Exploratory data analysis (EDA), a visual way to illustrate data to search for trends, was utilized. Univariate analysis was used to determine how well each descriptor is correlated to the known, experimental logP values of the training set. The coefficient of determination, R², was calculated separately for each descriptor against the experimental logP values of the training set.

Table 2 Geometric and quantum-chemical molecular descriptors used for machine learning

Full size table

Electronic structure calculations for vertical solvation and descriptor generation

Correlation consistent polarized valence triple-ζ (cc-pVTZ) basis sets were used for all calculations [30]. The recommended cc-pV(T + d)Z basis set was used for all Cl atoms [33]. All density functional (DFT) calculations were performed with Gaussian16 [34]. All challenge molecules were optimized with the B3LYP density functional with Grimme’s D3 dispersion correction with Becke-Johnson dampening (B3LYP-D3) [27,28,29]. The B3LYP-D3 functional was chosen since there are numerous conjugated ring structures that may exhibit intramolecular π-π stacking. Also, the use of Grimme’s dispersion correction can alter ligand orientation for systems exhibiting long-range noncovalent interactions. All structures were verified to be local minima via frequency calculations on an ‘ultrafine’ integration grid with harmonic frequencies and vibrational contributions to the Gibbs free energy were scaled to 0.989 in accordance with previous studies [14]. More sophisticated integration grids were used to optimize the structures of molecules displaying methyl rotations. For SM09 and SM12, HCl counterions were included.

The logP was calculated from single point calculations using six functionals: BLYP [27, 35], B3LYP [27, 28], PBE [36], PBE0 [36, 37], PW91 [38, 39], and B3PW91 [28, 38, 39]. The functionals BLYP, PBE, and PW91 are generalized gradient approximation (GGA) functionals and B3LYP, PBE0, and B3PW91 are hybrid-GGA functionals through the inclusion of exact exchange within the functional. All single point calculations use the Solvent Model for Density, or SMD, implicit solvent model to account for long-range solvent effects of water and 1-octanol on the solute [40].

Given the size of the challenge molecules and previous studies indicating that the percent of exact exchange within a functional correlates with logP predictions in the SAMPL5 competition [7], the domain-based local pair natural orbital (DLPNO) CCSD(T) method within the ORCA 4.0 program suite was utilized to predict logP [41,42,43]. For the DLPNO calculations, the coulomb-exchange fitting correlation consistent auxiliary basis set (cc-pVTZ/JK) was used in conjunction with the RIJCOSX approximation [44]. The RIJCOSX approximation is a resolution-of-the-identity (RI) approximation that calculates Coulomb integrals and uses a semi-numerical integration technique to calculate the exchange integrals. Within ORCA, the TightPNO setting was utilized to reduce the number of screened pair natural orbitals (PNOs) from the DLPNO-CCSD(T) calculation. Solvation effects were modeled using the SMD implicit solvent model for water and 1-octanol within ORCA 4.0.

To further expand on the application of the DLPNO methods, a combination of the DLPNO-ccCA [16] and Solv-ccCA [15] ab initio composite strategies, denoted as DLPNO-Solv-ccCA, was used. Outlined in Table 3, the DLPNO-Solv-ccCA approach utilizes the DLPNO methods and the SMD solvation model within the ccCA approach to calculate the free energy of solvation, which can be applied to solve for the logP following Eq. 1. The ccCA methodology has been extensively described in literature [13, 14, 45, 46].

Table 3 A schematic for DLPNO-Solv-ccCA

Full size table

A brief description of the ccCA methodology includes using Dunning’s augmented correlation consistent basis sets (aug-cc-pVnZ, n = D, T, Q) to extrapolate to the complete basis set (CBS) limit for the RI-HF and DLPNO-MP2 separately. Various extrapolation schemes have been developed to extrapolate electronic energies to the CBS limit, including a three-point extrapolation scheme based on the ζ-level of the basis set and a two-point extrapolation incorporating the maximum angular momentum of the basis set. For this work, an average of Peterson’s three-point extrapolation and the Schwartz two-point extrapolation using triple- and quadruple-zeta level basis sets was used in accordance with previous ccCA studies [14, 16, 47, 48]. Additive corrections are included to account for core-core correlation (ΔCC), core-valence correlation (ΔCV), and scalar relativity (ΔDK) using the DLPNO methods.

To predict the logP values with ab initio approaches, the logP was calculated with a vertical solvation method following Eq. 1 below

$$ {\text{logP = log}}\left( {\left[ {{\text{solute}}} \right]_{{{\text{octanol}}}} /\left[ {{\text{solute}}} \right]_{{{\text{water}}}} } \right) = \left( {\Delta {\text{G}}_{{{\text{water}}}} - \Delta {\text{G}}_{{{\text{octanol}}}} } \right)*\left( {{\text{log}}_{{{\text{1}}0}} e} \right)/kT $$

(1)

where k is Boltzmann’s constant, T is temperature, e is Euler’s number and ΔG is the free energy of solvation in water and octanol.

Results

All methods presented correspond to 12 SAMPL6 submissions (IDs: rs4ns, hsotx, c7t5j, 5t0yn, jc68f, fe8ws, 7gg6s, f3dpg, arw58, 4p2ph, cr3hs, and kxsp3); however, the results presented do not correspond to the predictions submitted to the challenge (see Table S2). Inconsistencies between the target structure and the calculated structure were identified for five of the SAMPL6 challenge molecules containing the 4-aminoquinazoline backbone, which originated from an initial guess of the molecular structure obtained by converting SMILES strings to 3D coordinates—resulting in poorer predictions (averaging 0.4 and 1.5 logP units for ML and QM approaches, respectively)—and required re-optimizations of the structure from a better initial guess to obtain the target structure (see Fig. S1). The structural incongruities were revised post-submission and are presented below.

Machine learning models

The initial training set of 97 molecules each had 19 molecular descriptors that were used to determine the response variable, logP. Using a univariate analysis to correlate each descriptor to the experimental logP values, the quantity of useful descriptors was reduced. Although most of the molecular descriptors had little to no correlation to the logP, some of the results followed trends previously identified by Reddy et al. [49], specifically, the van der Waals volume and dipole moments were highly correlated to the logP. As shown in Table 4, descriptors such as the superdelocalizability of the HOMO (SHOMO) and nucleophilic superdelocalizability that were expected to correlate well with logP predictions were poorly correlated in this training set. Polarizability, a descriptor not shown in Reddy’s analysis, had the highest correlation overall with an R² of 0.44. Additionally, molecular weight had a higher correlation to logP (R² = 0.36), however, this descriptor also had high correlation to the VDWV and VDWA descriptors. As molecular properties calculated in water and octanol, including HOMO, LUMO, dipole moment, were very highly correlated to each other with R² values of above 0.98, only descriptors in the water phase were considered. By removing the extraneous descriptors that have little correlation to logP, overfitting in the machine learning model was minimized.

Table 4 Correlation between the molecular descriptors to known logP values for the training set

Full size table

For the first approach, denoted ML-1, a multilinear regression (MLR) model was constructed using three descriptors that best correlated the training model: VDWV, polarizability, and the aqueous PBE dipole moment of the molecule. This approach captured roughly 65% of the variance in the training set and had a root mean squared error (RMSE) of 0.62. With a five-fold cross validation (CV), this approach yielded a RMSECV of 0.76. As illustrated in Fig. 2, the descriptors with the highest correlation to known logP values for the training set molecules have a linear association, however, for the challenge molecules, there is less overlap, indicating the need to modify the descriptors or the training data for a better fit.

In the second approach, denoted ML-2, a more robust model was constructed. Using the predictions from the MLR model, molecules determined as outliers were removed from the training set. An outlier is defined as any molecule that has an error in the predicted logP that deviates 3 times the standard deviation of the prediction error or greater from the entire training set. Based on this criteria, seven outliers were removed, leaving 90 molecules remaining in the training set. A partial-least-squares (PLS) regression model was used on the curated training data, which is a useful technique that can reveal intercorrelation among the descriptors along with structural similarities of the molecules in the training set that aid in optimizing the model [50]. To reduce the dimensionality of variables under consideration in this model, two latent variables, which are used to clump together variables to represent a commonality, were used. The two-component PLS model explained 72% of the variance in the training set, with an RMSE of 0.55 and an RMSECV of 0.65 via five-fold cross-validation, which was an improvement from MLR model used in ML-1.

To expand on ML-1 and ML-2, the list of descriptors considered was expanded by 55 additional 2D and 3D descriptors for the 97-molecule training set. Within the expanded set of molecular descriptors, several descriptors highly correlate (R² ≈ 0.8) with other descriptors (Fig. 3). For example, the van der Waals descriptors for area and volume (vdw_area and vdw_vol) correspond to some of the topological indices based on graph theory, including Wiener (wienerPath, wienerPol) and Zagreb indices. QSAR descriptors for solubility (logS) and molar refractivity (h_mr) highly intercorrelate with descriptors for the partition and distribution coefficients (h_logD and h_logP, respectively), and moderately correlate (R² ≈ 0.5) with various descriptors for the water accessible surface area (ASA, ASA+, and ASA-). Using a similar approach used for ML-1, the full list of descriptors was reduced by selecting descriptors that had high correlation for predicting logP within the training set and were disparate. In the third approach, ML-3, eight descriptors were selected: apol, ASA_H, a_hyd, GCUT_SMR_3, mr, vdw_vol, vsa_hyd, and polarizability. A PLS model with three latent variables was derived, capturing roughly 77% of the variance in the training set and yielding an RMSE of 0.50 and an RMSECV of 0.69.

In the fourth approach, ML-4, a total of 74 descriptors were used. A three-component PLS model was used to detect collinearity among variables and reduce overfitting. From this extended list of descriptors, a PLS model with 3 components was derived and captured roughly 85% of the variance in the training set and yielded an RMSE of 0.41 and an RMSECV of 0.56. By contrasting the reduced variables, similarities among the molecules in the training set and test set can be identified. In Fig. 4, the challenge molecules are quite dissimilar to the training molecules. The PLS method was able to successfully identify the collinearity among the descriptors and reduce the overall model complexity.

The logP values for the eleven challenge molecules were predicted using the four machine learning models that were constructed using different algorithms and molecular descriptors (Table 5). Applying the ML-1 approach to the challenge molecules, the MLR model overestimated the logP with a mean absolute error (MAE) of 1.07 logP units. Using the RMSECV as the model uncertainty, only five of the challenge molecules were predicted within range based on the model calibration. In contrast, using the PLS regression model and the curated training set for the challenge molecules, performed better than the ML-1 approach, with an MAE of 0.97 logP units; however, the ML-2 approach overestimated the logP in all cases, except for molecule SM15, and was only able to predict four molecules within the range of the predicted model uncertainty. With the inclusion of eight descriptors, the ML-3 model underestimates the logP for the challenge molecules, excluding SM13 and SM16, yielding an MAE of 1.01 logP units with respect to experiment. This approach was able to accurately predict the logP of SM13. Although the MAE increases from that of ML-2, the ML-3 model was able to determine logP within the expected uncertainty for six of the challenge molecules with respect to experiment. The best approach, ML-4, had an MAE and RMSE of 0.69 and 0.87 logP units, respectively. While this model generally underestimated the logP by an average of 0.68 logP units, with respect to experiment, it was only able to adequately predict the logP values for five challenge molecules within the model uncertainty. Regarding challenges to the molecule set, SM12, SM15 and SM16 were each predicted within the model uncertainties for three or more approaches, whereas SM07 and SM11 were outside of all windows of model uncertainty.

Table 5 Predicted logP values of all machine learning approaches

Full size table

Electronic structure calculations

Electronic structure methods, although useful in providing a framework for quantum mechanical-based machine learning approaches, have the capacity alone to predict physical properties in the absence of experiment without the need to train large amounts of data focused around targeted applications.

To evaluate the utility of quantum chemical approaches, density and wavefunction-based methods were used to predict the logP for the challenge molecules. For DFT approximations, GGA and hybrid-GGA functionals were employed to consider the role of parameterization and exact exchange within density functionals. As shown in Table 6, all six density functional methods considered underestimate the logP. Overall, BLYP was the best functional, with a MAE of 0.77 logP units and a max deviation of 1.51 logP units. Among the GGA functionals, the PBE and PW91 functionals performed similarly, having an MAE of 0.80 and 0.81 logP units, respectively. Comparing the hybrid-GGA functionals, B3LYP predicts the logP better than B3PW91 and PBE0. Regarding the contributions of including exact exchange, the addition of exact exchange to the respective GGA functional results in further underestimations of the logP ranging from 0.05 to 0.27 logP units.

Table 6 Predicted logP values with various density functionals with the cc-pVTZ basis set

Full size table

Taking a more aggressive approach, linear scaling wavefunction based methods were used to examine the efficacy of ab initio methodologies for predicting the logP and expand the applications of ab initio methodologies towards molecules of increasing size and complexity. Initially, DLPNO-CCSD(T) was used with a triple-ζ level basis set to predict the transfer free energies to estimate the logP (Table 7). This method underestimated the logP for most of the challenge molecules, excluding SM09 and SM14, resulting in an MAE of 0.50 logP units.

Table 7 Predicted logP values determined with DLPNO-CCSD(T) and DLPNO-Solv-ccCA

Full size table

While CCSD(T) is known for predicting energetic properties of small organic molecules typically within 1.0 kcal/mol (1.36 logP units), composite methods, in particular ccCA, have been shown to reach similar accuracies, and in some cases, achieve even better results than CCSD(T)/complete basis set limit calculations, but at reduced computational cost [51]. At the time of the challenge, as the experimental results were unavailable, the DLPNO-Solv-ccCA composite strategy was used with intent to predict logP values with a targeted accuracy equivalent to that of DLPNO-CCSD(T)/aug-cc-pCV∞Z-DK. Similar to the results obtained by DLPNO-CCSD(T), the composite method underestimated the logP for most of the challenge molecules (MAE = 0.51 logP units), with the exception of SM09 and SM14, in which the composite approach was 0.03 logP units closer to the experiment values than the DLPNO-CCSD(T) method.

Discussion

Machine learning

Machine learning is popular in modern-day computing and has the potential to be very useful for understanding complex chemical phenomena; however, the learning ability is limited by the quality and quantity of the data. For the purposes of this challenge, the NIST Chemical WebBook’s reference of logP values was used to create a known relationship between the structures of similar molecules to the challenge set and their logP values, and the relationship was then transferred to the challenge molecules. Typically, the training data set should be very similar to the set to which the model would be applied. Unfortunately, this was not true for this investigation because the challenge molecules were larger than molecules available from the database and contained more complex heterocycles (see Tables S3 and S4).

For the generation of quantum chemical molecular descriptors, choosing an inexpensive, yet useful electronic structure method can be challenging. Considering the functionals employed in this study, it was intentional to select functionals within the same developer family, i.e. PBE and PBE0, BLYP and B3LYP, as well as PW91 and B3PW91, to maintain consistency among parameters within each functional. Additionally, these functionals were chosen due to their ability to predict HOMO-LUMO gaps, one of the quantum chemical molecular descriptors, for a set of small main group molecules [52]. By transferring this concept to machine learning, this allows for greater transparency as to which molecular descriptor is more correlated with the desired property so that any empirical parameters from significantly different functional types will not bias the outcome. Building upon this idea, PBE and PBE0 were chosen as the methods for determining quantum chemical descriptors for the machine learning models based on the use of no empirical parameters used to develop the functionals. This will allow the machine learning model to better gauge trends between different molecular descriptors.

The predicted results of the challenge molecules appeared to improve as descriptors were added, illustrating that the descriptors were accounting for molecular properties that correlated to logP, and that the training set molecules were sufficiently structurally relevant to the challenge set molecules to permit the machine learning algorithms to identify useful patterns. As shown in Fig. 5, for the ML-1 model, the challenge molecules were biased to more hydrophobic predictions, with respect to experiment. Using the same molecular descriptors and removing outlier molecules from the training set did not result in a significant improvement for the challenge molecules. On the contrary, while using additional descriptors improved the correlation between predicted and experimental logP values for the training set, selecting only a smaller subset of descriptors as done in the ML-3 model shifted the predictions of the challenge molecules to be more hydrophilic, whereas using many molecular descriptors adds to the flexibility of the model, as the best predictions were obtained using the ML-4 model. As shown in Fig. 6, several descriptors used in ML-4 attribute to the improvement of the model, including logP(o/w), h_logD, h_logP, h_logS, logS, a_hyd, and vsa_hyd, as these descriptors are highly correlated to the experimental logP for the training set of molecules.

Several common machine learning techniques were used for the challenge, including principle component regression (PCR), partial least squares (PLS), and support vector machines (SVM) [50, 53,54,55]. Both PCR and PLS are linear methods which group descriptors into orthogonal components and reduce overfitting by removing excess components. SVM is an alternative, generally non-linear method that draws boundaries to best fit data. To help prevent overfitting, five-fold cross-validation was used to validate the models. In this cross-validation approach, data is randomly split into five roughly equal subsets, and then 80% of the data are iteratively selected and used to build a model which predicts the remaining 20%, giving an estimate of how the model would perform on data not used for training [21]. Additionally, the leave-one-out (LOO) method for cross-validation was considered to detect outliers, as it is a more extensive technique than five-fold cross-validation. In this cross-validation approach, one molecule of the full training set is selected for testing, whereas the remaining molecules are used for building the model, calculating the logP and RMSE per iteration. Outliers were identified by logP values greater than 3 times the standard deviation of the error. The LOO method identified the seven molecules previously determined as outliers via five-fold cross-validation and two additional molecules. Results for all methodologies were fairly similar, which implies that all models were converging to a constant value.

Comparing the relative computational cost of the quantum chemical methods used, the DFT methods scale approximately N³–N⁴ based on the complexity of the functional whereas the linear scaling wavefunction based methods scale N⁵, in which N refers to the system’s size (the number of basis functions). In the traditional ccCA method, determining the higher-order electron correlation contributions is the step most-time consuming, scaling N⁷, however, DLPNO-Solv-ccCA scales as N⁵ because of the reduced number of orbital triples, which arises from having three electron pairs survive the electron pair screening process, included in the DLPNO-CCSD(T) calculations [56, 57]. Given the size of the molecules for both the training and challenge sets, it is possible to expand the quantity and quality of quantum chemical descriptors to improve the predictability of the machine learning models. As the optimization of the molecular structure is necessary to generate 3D-descriptors, the protocol could be augmented by employing semiempirical or more approximate methods to obtain the initial starting guess of the molecular structure, allowing for the expansion of larger data sets to be used in the training of machine learning models.

Observing trends among the four machine learning models against the challenge set, adjustments to each model resulted in a significant improvement for most challenge molecules. SM11 was a notable outlier and was poorly predicted in each model. In contrast to other challenge molecules, the predicted van der Waals volume as well as other descriptors were separated from the median and are believed to be the source of error the machine learning predictions. Furthermore, six of the challenge molecules (SM02, SM04, SM07, SM09, SM12, SM13) represent the 4-amino quinazoline scaffold present in protein kinase inhibitors as anticancer agents [50], that primarily differ by substituents on the 4-amino. Ranking the logP of these 4-amino quinazoline derivatives, the four machine learning models were able to properly rank some of the molecules (Table 8). The ML-1 and ML-2 models properly ranked SM09 and was close (within 1 order) to properly ranking SM12, SM04, and SM02. The ranking of the six molecules is actually identical for both methods. As the only difference between both methods is the algorithm used, this highlights that the descriptors have a great weight into determining these substituent effects. The ML-3 model correctly ranks SM07 and SM02 and was close to properly ranking SM09 and SM04. Oddly, the best model with the lowest MAE, ML-4, failed to correctly rank the six derivatives but was able to predict SM13 and SM02 within one ranking unit. These trends reveal that some descriptors, or possibly the use of many descriptors, increases the resolution overall for machine learning, but decreases the resolution for contrasting substituent effects.

Table 8 Ranking the logP predictions in order from smallest to largest logP for the molecules with the 4-amino quinazoline scaffold

Full size table

Electronic structure

All density functional approaches underestimated the logP values [as indicated by the negative sign for the mean signed error (MSE)] and resulted in equal magnitudes between the MSE and MAE. Functional choice was a factor in logP prediction. The GGA functionals (BLYP, PBE, and PW91) yielded a lower MSE, MAE, and RMSE than the hybrid-GGA functionals (B3LYP, PBE0, and B3PW91) by approximately 0.10 to 0.13 logP units. This implies that the inclusion of exact exchange does not benefit the energetics for the prediction of water-octanol partition coefficients. The BLYP functional yielded the highest logP values, which correspond to the lowest MAE and RMSE among the examined functionals. Across all functionals, SM07 yielded the largest error in the logP of approximately 1.61 logP units and SM09 yielded the smallest error of approximately 0.09 logP units.

In considering more rigorous methods known to predict properties within 1.0 kcal/mol of experiment for small organic molecules, a question is whether or not the increase in computational expense makes an important impact in the prediction of partition coefficients of drug-like molecules. For this SAMPL challenge, there is a useful impact. The DLPNO-CCSD(T) and DLPNO-Solv-ccCA methods were among the best performing methods examined in this study. Overall, both methods resulted in underestimations of logP, except for SM09 and SM14. In general, DLPNO-Solv-ccCA yielded lower logP values than DLPNO-CCSD(T) by less than 0.06 logP units for all molecules with the exclusion of SM08, which for DLPNO-Solv-ccCA increased the predicted logP unit by 0.02 logP units. DLPNO-Solv-ccCA includes DLPNO-MP2 calculations, which contributed to the lowering of predicted logP relative to DLPNO-CCSD(T).

For DLPNO-Solv-ccCA, the most significant contribution to the predicted logP value was the core-core correlation contribution (ΔCC) (Table 9). In comparing the DLPNO-CCSD(T)/cc-pVTZ logP values to those obtained with DLPNO-MP2/CBS + ΔCC, which effectively models DLPNO-CCSD(T)/CBS, the additional contribution reduced the predicted logP values by 0.02 to 0.08 logP units, with the exception of SM15, which increased the logP value by 0.01 logP units. The largest difference in magnitude between DLPNO-CCSD(T)/cc-pVTZ and DLPNO-MP2/CBS + ΔCC logP values is for SM09 and SM13 at 0.08 logP units. This infers that including a larger basis set would decrease the predicted logP value without having to calculate logP values at the DLPNO-CCSD(T)/aug-cc-pVQZ level.

Table 9 Decomposition of DLPNO-Solv-ccCA predicted logP values

Full size table

The core-valence contribution (ΔCV) was the next largest contribution as the logP values changed by less than 0.05 logP units. This is most likely due to the small number of additional electrons from the core included explicitly in the calculation with only the 1s shell for first-row atoms and 2s and 2p shells for chlorine atoms. The largest change in logP value from the inclusion of the core-valence contribution to the logP value for DLPNO-MP2/CBS + ΔCC was an increase by 0.05 logP units for SM09. As well, the predicted logP value decreased when including the core-valence contribution for SM07, SM14, and SM15, by 0.01, 0.01, and 0.02 logP units, respectively. This shows that there is no general trend with the inclusion of the sub-valence electrons in the logP calculation. For relativistic corrections, the predicted logP values were altered by only ≤ 0.01 logP units relative to logP values generated with DLPNO-MP2/CBS + ΔCC + ΔCV energies. Since logP is a relative quantity as shown in Eq. 1, this essentially negates the effect of relativity in solvent phase. Overall, the core-valence and relativistic contributions obtained at the DLPNO-MP2 level resulted in a small change (0.05 logP units) in the predicted logP values relative to DLPNO-CCSD(T)/cc-pVTZ.

Revisiting the 4-amino quinazoline derivatives, the DLPNO-CCSD(T) method did not correctly rank any of the six molecules but was close to ranking 4 of the 6 molecules (Table 8). With the core-core correlation correction (ΔCC) included in the composite approach, the trends are nearly identical to the DLPNO-CCSD(T) predictions with the exception of SM04, for which DPLNO-Solv-ccCA predicts the correct rank. Notable outliers in which the electronic structure methods had the most difficulty in determining logP values are SM02 and SM07. The error in the logP for both molecules was larger than the other molecules within the challenge set. The structure of SM02 is different from other molecules in that it has a trifluoromethyl phenyl group attached to the 4-amino quinazoline. As modeled, the challenge molecules are generally planar. Considering various low-energy conformations may result in improvements to logP predictions. The treatment of electrostatics in solution have an impact on the prediction. Including the HCl counterion was significant in the logP prediction for both SM09 and SM12. For example, removing the counterion for SM09 decreases the predicted value by 1.4 logP units using the DLPNO-CCSD(T) method. Incorporating the ionic strength of the solution may be a useful consideration for future improvement.

Conclusion

For the electronic structure methods, the GGA functionals (BLYP, PBE, and PW91) yielded lower MAEs than the hybrid-GGA functionals (B3LYP, PBE0, and B3PW91) with BLYP yielding the lowest MAE among all utilized density functionals. DLPNO-CCSD(T) and DLPNO-Solv-ccCA yielded the lowest MAE of 0.50 and 0.51 logP units, respectively, which was the lowest among all methods examined in this work. The core-valence and scalar relativistic corrections altered the DLPNO-Solv-ccCA logP values by less than 0.05 logP units.

The machine learning approaches tended to yield favorable results with an overall MAE of approximately 1 logP unit but were susceptible to the choice of data used in the training set and in the choice of descriptors. This is generally one of the drawbacks of machine learning algorithms, as there are often no clear systematic ways to improve the accuracy beyond adding more data to the model, which may not be readily available. Overall, the machine learning approaches showed potential for further development of machine learning methods towards logP prediction with a relatively small training set.

References

Bannan CC, Mobley DL, Skillman AG (2018) SAMPL6 challenge results from pKa predictions based on a general Gaussian process model. J Comput Aided Mol Des 32:1165–1177. https://doi.org/10.1007/s10822-018-0169-z
Article CAS PubMed PubMed Central Google Scholar
Nicholls A, Wlodek S, Grant JA (2009) The SAMP1 solvation challenge: further lessons regarding the pitfalls of parametrization†. J Phys Chem B 113:4521–4532. https://doi.org/10.1021/jp806855q
Article CAS PubMed Google Scholar
Geballe MT, Skillman a G, Nicholls A et al (2010) The SAMPL2 blind prediction challenge: introduction and overview. J Comput Aided Mol Des 24:259–279. https://doi.org/10.1007/s10822-010-9350-8
Article CAS PubMed Google Scholar
Geballe MT, Guthrie JP (2012) The SAMPL3 blind prediction challenge: transfer energy overview. J Comput Aided Mol Des 26:489–496. https://doi.org/10.1007/s10822-012-9568-8
Article CAS PubMed Google Scholar
Muddana HS, Fenley AT, Mobley DL, Gilson MK (2014) The SAMPL4 host–guest blind prediction challenge: an overview. J Comput Aided Mol Des 28:305–317. https://doi.org/10.1007/s10822-014-9735-1
Article CAS PubMed PubMed Central Google Scholar
Yin J, Henriksen NM, Slochower DR et al (2017) Overview of the SAMPL5 host–guest challenge: are we doing better? J Comput Aided Mol Des 31:1–19. https://doi.org/10.1007/s10822-016-9974-4
Article CAS PubMed Google Scholar
Jones MR, Brooks BR, Wilson AK (2016) Partition coefficients for the SAMPL5 challenge using transfer free energies. J Comput Aided Mol Des 30:1129–1138. https://doi.org/10.1007/s10822-016-9964-6
Article CAS PubMed PubMed Central Google Scholar
Rizzi A, Murkli S, McNeill JN et al (2018) Overview of the SAMPL6 host–guest binding affinity prediction challenge. J Comput Aided Mol Des 32:937–963. https://doi.org/10.1007/s10822-018-0170-6
Article CAS PubMed PubMed Central Google Scholar
Klamt A, Schüürmann G (1993) COSMO: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J Chem Soc Perkin Trans 2:799–805. https://doi.org/10.1039/P29930000799
Article Google Scholar
Tielker N, Eberlein L, Güssregen S, Kast SM (2018) The SAMPL6 challenge on predicting aqueous pKa values from EC-RISM theory. J Comput Aided Mol Des 32:1151–1163. https://doi.org/10.1007/s10822-018-0140-z
Article CAS PubMed Google Scholar
Zeng Q, Jones MR, Brooks BR (2018) Absolute and relative pKa predictions via a DFT approach applied to the SAMPL6 blind challenge. J Comput Aided Mol Des 32:1179–1189. https://doi.org/10.1007/s10822-018-0150-x
Article CAS PubMed PubMed Central Google Scholar
Işık M, Levorse D, Mobley DL et al (2019) Octanol-water partition coefficient measurements for the SAMPL6 blind prediction challenge. J Comput Aided Mol Des. https://doi.org/10.1007/s10822-019-00271-3
Article PubMed Google Scholar
DeYonker NJ, Cundari TR, Wilson AK (2006) The correlation consistent composite approach (ccCA): an alternative to the Gaussian-n methods. J Chem Phys 124:114104. https://doi.org/10.1063/1.2173988
Article CAS PubMed Google Scholar
DeYonker NJ, Wilson BR, Pierpont AW et al (2009) Towards the intrinsic error of the correlation consistent composite approach (ccCA). Mol Phys 107:1107–1121
Article CAS Google Scholar
Riojas AG, Wilson AK (2014) Solv-ccCA: implicit solvation and the correlation consistent composite approach for the determination of pK_a. J Chem Theory Comput 10:1500–1510. https://doi.org/10.1021/ct400908z
Article CAS PubMed Google Scholar
Patel P, Wilson AK (2020) Utilization of the domain-based local pair natural orbital methods within the correlation consistent composite approach. J Comput Chem. https://doi.org/10.1002/jcc.26129
Article PubMed Google Scholar
Lo Y-C, Rensi SE, Torng W, Altman RB (2018) Machine learning in chemoinformatics and drug discovery. Drug Discov Today 23:1538–1546. https://doi.org/10.1016/j.drudis.2018.05.010
Article CAS PubMed PubMed Central Google Scholar
Alves de Lima Ribeiro F, Ferreira MMC (2003) QSPR models of boiling point, octanol–water partition coefficient and retention time index of polycyclic aromatic hydrocarbons. J Mol Struct THEOCHEM 663:109–126. https://doi.org/10.1016/j.theochem.2003.08.107
Article CAS Google Scholar
Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T, Prachayasittikul V (2009) A practical overview of quantitative structure-activity relationship. EXCLI J 8:74–88
Google Scholar
Yousefinejad S, Hemmateenejad B (2015) Chemometrics tools in QSAR/QSPR studies: a historical perspective. Chemom Intell Lab Syst 149:177–204. https://doi.org/10.1016/j.chemolab.2015.06.016
Article CAS Google Scholar
Thanikaivelan P, Subramanian V, Raghava Rao J, Unni Nair B (2000) Application of quantum chemical descriptor in quantitative structure activity and structure property relationship. Chem Phys Lett 323:59–70. https://doi.org/10.1016/S0009-2614(00)00488-7
Article CAS Google Scholar
Ochi S, Miyao T, Funatsu K (2017) Structure modification toward applicability domain of a QSAR/QSPR model considering activity/property. Mol Inform 36:1700076. https://doi.org/10.1002/minf.201700076
Article CAS Google Scholar
Welborn M, Cheng L, Miller TF (2018) Transferability in machine learning for electronic structure via the molecular orbital basis. J Chem Theory Comput 14:4772–4779. https://doi.org/10.1021/acs.jctc.8b00636
Article CAS PubMed Google Scholar
Sangster J (1989) Octanol-water partition coefficients of simple organic compounds. J Phys Chem Ref Data 18:1111–1229. https://doi.org/10.1063/1.555833
Article CAS Google Scholar
Kim S, Chen J, Cheng T et al (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47:D1102–D1109. https://doi.org/10.1093/nar/gky1033
Article PubMed Google Scholar
O’Boyle NM, Banck M, James CA et al (2011) Open Babel: an open chemical toolbox. J Cheminform 3:33. https://doi.org/10.1186/1758-2946-3-33
Article CAS PubMed PubMed Central Google Scholar
Lee C, Yang W, Parr RG (1988) Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density. Phys Rev B 37:785–789. https://doi.org/10.1103/PhysRevB.37.785
Article CAS Google Scholar
Becke AD (1993) Density-functional thermochemistry. III. The role of exact exchange. J Chem Phys 98:5648–5652. https://doi.org/10.1063/1.464913
Article CAS Google Scholar
Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J Chem Phys 132:154104. https://doi.org/10.1063/1.3382344
Article CAS PubMed Google Scholar
Kendall RA, Dunning TH Jr, Harrison RJ (1992) Electron affinities of the first-row atoms revisited. Systematic basis sets and wave functions. J Chem Phys 96:6796–6806. https://doi.org/10.1063/1.462569
Article CAS Google Scholar
Zhao YH, Abraham MH, Zissimos AM (2003) Fast calculation of van der Waals volume as a sum of atomic and bond contributions and its application to drug compounds. J Org Chem 68:7368–7373
Article CAS Google Scholar
Molecular Operating Environment (MOE) (2018) 2018.01. Chemical Computing Group Inc., Montreal
Dunning TH, Peterson KA, Wilson AK (2001) Gaussian basis sets for use in correlated molecular calculations. X. The atoms aluminum through argon revisited. J Chem Phys 114:9244. https://doi.org/10.1063/1.1367373
Article CAS Google Scholar
Frisch MJ, Trucks GW, Schlegel HB et al (2016) Gaussian 16, revision A.03. Gaussian Inc., Wallingford
Becke AD (1988) Density-functional exchange-energy approximation with correct asymptotic behavior. Phys Rev A 38:3098–3100. https://doi.org/10.1103/PhysRevA.38.3098
Article CAS Google Scholar
Perdew JP, Burke K, Ernzerhof M (1996) Generalized gradient approximation made simple. Phys Rev Lett 77:3865–3868. https://doi.org/10.1103/PhysRevLett.77.3865
Article CAS PubMed Google Scholar
Ernzerhof M, Scuseria GE (1999) Assessment of the Perdew-Burke-Ernzerhof exchange-correlation functional. J Chem Phys 110:5029–5036. https://doi.org/10.1063/1.478401
Article CAS Google Scholar
Perdew JP, Chevary JA, Vosko SH et al (1992) Atoms, molecules, solids, and surfaces: applications of the generalized gradient approximation for exchange and correlation. Phys Rev B 46:6671–6687. https://doi.org/10.1103/PhysRevB.46.6671
Article CAS Google Scholar
Perdew JP, Wang Y (1992) Accurate and simple analytic representation of the electron-gas correlation energy. Phys Rev B 45:13244–13249. https://doi.org/10.1103/PhysRevB.45.13244
Article CAS Google Scholar
Marenich AV, Cramer CJ, Truhlar DG (2009) Universal solvation model based on solute electron density and on a continuum model of the solvent defined by the bulk dielectric constant and atomic surface tensions. J Phys Chem B 113:6378–6396. https://doi.org/10.1021/jp810292n
Article CAS PubMed Google Scholar
Pinski P, Riplinger C, Valeev EF, Neese F (2015) Sparse maps—a systematic infrastructure for reduced-scaling electronic structure methods. I. An efficient and simple linear scaling local MP2 method that uses an intermediate basis of pair natural orbitals. J Chem Phys 143:34108. https://doi.org/10.1063/1.4926879
Article CAS Google Scholar
Pavošević F, Peng C, Pinski P et al (2017) SparseMaps—a systematic infrastructure for reduced scaling electronic structure methods. V. Linear scaling explicitly correlated coupled-cluster method with pair natural orbitals. J Chem Phys https://doi.org/10.1063/1.4979993
Article PubMed Google Scholar
Neese F (2018) Software update: the ORCA program system, version 4.0. Wiley Interdiscip Rev Comput Mol Sci 8:e1327. https://doi.org/10.1002/wcms.1327
Article Google Scholar
Neese F, Wennmohs F, Hansen A, Becker U (2009) Efficient, approximate and parallel Hartree–Fock and hybrid DFT calculations. A “chain-of-spheres” algorithm for the Hartree–Fock exchange. Chem Phys 356:98–109. https://doi.org/10.1016/j.chemphys.2008.10.036
Article CAS Google Scholar
Laury ML, DeYonker NJ, Jiang W, Wilson AK (2011) A pseudopotential-based composite method: the relativistic pseudopotential correlation consistent composite approach for molecules containing 4d transition metals (Y-Cd). J Chem Phys 135:214103–214110.
Article Google Scholar
Jiang W, DeYonker NJ, Determan JJ, Wilson AK (2012) Toward accurate thermochemistry of first row transition metal complexes. J Phys Chem A 116:870
Article CAS Google Scholar
Jorgensen KR, Wilson AK (2012) Enthalpies of formation for organosulfur compounds: atomization energy and hypohomodesmotic reaction schemes via ab initio composite methods. Comput Theor Chem 991:1–12. https://doi.org/10.1016/j.comptc.2012.03.003
Article CAS Google Scholar
Alsunaidi ZHA, Wilson AK (2016) DFT and ab initio composite methods: investigation of oxygen fluoride species. Comput Theor Chem 1095:71–82. https://doi.org/10.1016/j.comptc.2016.08.024
Article CAS Google Scholar
Reddy KN, Locke MA (1996) Molecular properties as descriptors of octanol-water partition coefficients of herbicides. Water Air Soil Pollut 86:389–405. https://doi.org/10.1007/BF00279169
Article CAS Google Scholar
Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130. https://doi.org/10.1016/S0169-7439(01)00155-1
Article CAS Google Scholar
Weber R, Wilson AK (2015) Do composite methods achieve their target accuracy? Comput Theor Chem 1072:58–62. https://doi.org/10.1016/j.comptc.2015.08.015
Article CAS Google Scholar
Zhang G, Musgrave CB (2007) Comparison of DFT methods for molecular orbital eigenvalue calculations. J Phys Chem A 111:1554–1561. https://doi.org/10.1021/jp061633o
Article CAS PubMed Google Scholar
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2:37–52. https://doi.org/10.1016/0169-7439(87)80084-9
Article CAS Google Scholar
Karlen Y, McNair A, Perseguers S et al (2007) Statistical significance of quantitative PCR. BMC Bioinformatics 8:131. https://doi.org/10.1186/1471-2105-8-131
Article CAS PubMed PubMed Central Google Scholar
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300. https://doi.org/10.1023/A:1018628609742
Article Google Scholar
Riplinger C, Sandhoefer B, Hansen A, Neese F (2013) Natural triple excitations in local coupled cluster calculations with pair natural orbitals. J Chem Phys 139:134101. https://doi.org/10.1063/1.4821834
Article CAS PubMed Google Scholar
Riplinger C, Pinski P, Becker U et al (2016) Sparse maps—a systematic infrastructure for reduced-scaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory. J Chem Phys. https://doi.org/10.1063/1.4939030
Article PubMed Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the University of North Texas Academic Computing Services for the use of the UNT Research Clusters. Computational resources were provided via the NSF Major Research Instrumentation program supported by the National Science Foundation under Grant No. CHE-1531468. This research was supported in part by the Intermural Research Program of the National Heart, Lung, and Blood Institute of the National Institutes of Health and utilized the high-performance computational capabilities of the LoBoS and Biowulf Linux clusters at the National Institutes of Health (http://www.lobos.nih.gov and http://biowulf.nih.gov).

Author information

Authors and Affiliations

Department of Chemistry, Michigan State University, East Lansing, MI, 48824-1322, USA
Prajay Patel & Angela K. Wilson
Department of Chemistry and Center for Advanced Scientific Computing and Modeling (CASCaM), University of North Texas, Denton, TX, 76203-5070, USA
David M. Kuntz & Angela K. Wilson
Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, 20852-5690, USA
Michael R. Jones & Bernard R. Brooks

Authors

Prajay Patel
View author publications
You can also search for this author in PubMed Google Scholar
David M. Kuntz
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Jones
View author publications
You can also search for this author in PubMed Google Scholar
Bernard R. Brooks
View author publications
You can also search for this author in PubMed Google Scholar
Angela K. Wilson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angela K. Wilson.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 96.5 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patel, P., Kuntz, D.M., Jones, M.R. et al. SAMPL6 logP challenge: machine learning and quantum mechanical approaches. J Comput Aided Mol Des 34, 495–510 (2020). https://doi.org/10.1007/s10822-020-00287-0

Download citation

Received: 16 October 2019
Accepted: 08 January 2020
Published: 30 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10822-020-00287-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SAMPL6 logP challenge: machine learning and quantum mechanical approaches

Abstract

Similar content being viewed by others

The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules

Dataset’s chemical diversity limits the generalizability of machine learning predictions

Machine learning for the prediction of molecular dipole moments obtained by density functional theory

Introduction