Introduction

A partition coefficient (P) represents the ratio of the concentration of a solute between two immiscible phases in an un-ionized state. If one of the phases is water, then the partition coefficient can be considered as a way of measuring the hydrophobicity or hydrophilicity of a substance. The coefficient, often expressed as a log of the ratios, is a useful estimator of how pharmaceutical drugs are absorbed.

While partition coefficients have been determined experimentally for decades, theoretical approaches toward calculating physicochemical properties such as solubility and ionization are increasingly being used in conjunction with experiment to aid in expediting the determination of properties. Theoretical methods are used in drug screening as well, to rapidly test a wide variety of pharmaceutical candidates and narrow down the most promising targets for future synthesis. However, theoretical approaches have drawbacks as well, notably that the complex nature of chemical systems is often difficult to model and that there are often no systematic ways to determine the accuracies of the predictions, especially when multiple models for theoretical predictions exist, nor systematically improve upon the predictions. For theoretical methods to be of use, they need to be able to accurately (and routinely) predict properties of chemical systems where experimental values are not known.

Statistical assessment of modeling of proteins and ligands (SAMPL) challenges serve as an excellent platform to assess current and identify new computational modelling approaches for use in drug discovery. The SAMPL challenges are blind competitions focused on predicting a variety of properties, including pKa, binding affinities, and logP, in the absence of experimental data to ensure no bias in the consideration of computational models [1,2,3,4,5,6,7,8]. The SAMPL6 challenge for the prediction of physical properties was issued in two parts. The first part of the challenge entailed the prediction of aqueous microscopic and macroscopic pKa values for 24 small drug-like molecules. While a wide range of methods were utilized in the competition, including pure quantum mechanical (QM), molecular mechanical (MM), QM/MM hybrid approaches, and database-based algorithms, several quantum chemical approaches performed better than parameter-dependent methods [1, 9,10,11]. As well, machine learning approaches were utilized, as they are becoming increasingly popular for predictive models [1]. Part II of the SAMPL6 challenge introduces the prediction of water-octanol partition coefficients (logP) for 11 of the 24 molecules from Part I [12].

In this work, two different approaches are used to predict the logP as a part of the SAMPL6 challenge. To build upon knowledge gained from the SAMPL5 distribution-coefficient challenge, electronic structure calculations in combination with continuum solvation models are used to predict the logP and gauge for optimal approaches that provide a balance between efficiency and quality. Composite approaches are designed to target results possible with a higher level of theory, but at a fraction of the computational cost (memory, disk space, CPU time). The correlation consistent composite approach (ccCA) is an ab initio composite approach that has been utilized for main group thermochemistry and was initially proposed as an alternative to the Gaussian-n composite methods [13, 14]. Over the past several years, variants have been developed and have been used to target a number of different thermodynamic properties of main group molecules. For example, Solv-ccCA utilizes implicit solvent models to predict the pKas of amines and carbon acids within 1 kcal/mol with a direct thermodynamic scheme [15]. As well, DLPNO-ccCA combines the domain-based local pair natural orbital (DLPNO) methods with ccCA to enable ab initio composite methodologies to reach molecules of larger size than traditional composite methods, with a ~ 95–97% CPU time reduction relative to ccCA [16]. To account for the size of the SAMPL6 molecules and the calculation of partition coefficients between water and 1-octanol, solvation effects have been included within DLPNO-ccCA—denoted as DLPNO-Solv-ccCA—in this work.

As the use of machine learning techniques are increasingly becoming popular and have demonstrated utility, a machine learning approach is used here by interpolating a relationship between molecules with known logP values and similar structures to those of the challenge molecules. Machine learning approaches infer relationships between a set of known data and a property of interest and then use these relationships to predict the same property of interest for new data where the property is not known [17]. As such, machine learning techniques require many data points of known properties for training; however, they are able to make predictions faster on new data than electronic structure methods. Machine learning techniques in chemistry-related fields often utilize quantitative structure–activity relationships (QSAR) that allow a chemical system to be correlated to a property of interest via a parameter or descriptor [18,19,20]. QSAR descriptors are limitless and can be simplified to include the identities of atoms, bonds, and bond angles in a system, as well as charge, electronic, and thermodynamic properties, among others [21]. Machine learning algorithms are generally faster than algorithms for conventional electronic structure methods but require careful curation of the data used for training [22, 23].

Methods

For the logP challenge, the molecular structures were provided pursuant to the challenge, however, the experimentally determined logP values were withheld so that all attempts at predicting the logP values would be uninfluenced. After completion of the challenge, the experimental results were provided to gauge the predictions and the results from the current study then are compared to these results. The eleven challenge molecules from the SAMPL6 pKa challenge that resemble fragments of small protein kinase inhibitors, using the issued tautomer state, were used to predict the octanol-water partition coefficient (logP). These molecules are shown in Fig. 1. The molecules are abbreviated as SM02, SM04, SM07, SM08, SM09, SM11, SM12, SM13, SM14, SM15, and SM16 in accordance to their naming scheme during the SAMPL6 pKa challenge.

Fig. 1
figure 1

2D structures of the 11 challenge molecules in the SAMPL6 challenge

Machine learning and QSAR

A total of 97 molecules were selected from the NIST Standard Reference Database [24] of partition coefficients for use in the training models (Table 1). The molecules were chosen based on their structural similarity to the challenge set molecules, e.g. molecules with aromatic rings, heterocyclic rings, ketones, esters, ethers, alcohols, as well as molecules that contain N, O, F, and Cl. The corresponding recommended logP value was also retrieved for all molecules selected from the NIST database. SMILES strings were retrieved from PubChem [25] and were transformed into 3-dimensional coordinates via Open Babel [26]. Structures were optimized using the B3LYP density functional including Grimme’s D3 dispersion correction with Becke-Johnson dampening (B3LYP-D3) [27,28,29] and the correlation consistent polarized valence triple-ζ (cc-pVTZ) basis sets [30].

Table 1 Molecules used in the training data set

For the training model, both 2D and 3D QSAR descriptors were generated using the optimized structures. Initially, 19 molecular descriptors were generated for each molecule in the training and challenge sets (Table 2). The van der Waals volume (VDWV) and area (VDWA) were calculated using the fast calculation approach from Zhao et al. [31]. This approach requires only knowing the number of bonds, count of each atom, and number of aromatic and nonaromatic rings present in each species, which are all invariant to the 3D structure. Incorporating the geometric topology, quantum chemical descriptors, including HOMO/LUMO energies, dipole moments, superdelocalizablities, and polarizabilities were derived from electronic structure calculations using the PBE density functional approximation in both gas and solvated phases. Approaches utilizing these descriptors are labeled as ML-1 and ML-2. To achieve a better correlation, the list of descriptors was expanded to include additional 2D and 3D descriptors such as molar refractivity and surface accessibility (Table S1). All additional QSAR descriptors were generated using MOE 2018 and used for approaches ML-3 and ML-4 [32]. Once descriptors were generated, feature selection and exploratory data analysis were performed. Feature selection helps prevent overfitting in the model by trimming excess descriptors to remove irrelevant features. Exploratory data analysis (EDA), a visual way to illustrate data to search for trends, was utilized. Univariate analysis was used to determine how well each descriptor is correlated to the known, experimental logP values of the training set. The coefficient of determination, R2, was calculated separately for each descriptor against the experimental logP values of the training set.

Table 2 Geometric and quantum-chemical molecular descriptors used for machine learning

Electronic structure calculations for vertical solvation and descriptor generation

Correlation consistent polarized valence triple-ζ (cc-pVTZ) basis sets were used for all calculations [30]. The recommended cc-pV(T + d)Z basis set was used for all Cl atoms [33]. All density functional (DFT) calculations were performed with Gaussian16 [34]. All challenge molecules were optimized with the B3LYP density functional with Grimme’s D3 dispersion correction with Becke-Johnson dampening (B3LYP-D3) [27,28,29]. The B3LYP-D3 functional was chosen since there are numerous conjugated ring structures that may exhibit intramolecular π-π stacking. Also, the use of Grimme’s dispersion correction can alter ligand orientation for systems exhibiting long-range noncovalent interactions. All structures were verified to be local minima via frequency calculations on an ‘ultrafine’ integration grid with harmonic frequencies and vibrational contributions to the Gibbs free energy were scaled to 0.989 in accordance with previous studies [14]. More sophisticated integration grids were used to optimize the structures of molecules displaying methyl rotations. For SM09 and SM12, HCl counterions were included.

The logP was calculated from single point calculations using six functionals: BLYP [27, 35], B3LYP [27, 28], PBE [36], PBE0 [36, 37], PW91 [38, 39], and B3PW91 [28, 38, 39]. The functionals BLYP, PBE, and PW91 are generalized gradient approximation (GGA) functionals and B3LYP, PBE0, and B3PW91 are hybrid-GGA functionals through the inclusion of exact exchange within the functional. All single point calculations use the Solvent Model for Density, or SMD, implicit solvent model to account for long-range solvent effects of water and 1-octanol on the solute [40].

Given the size of the challenge molecules and previous studies indicating that the percent of exact exchange within a functional correlates with logP predictions in the SAMPL5 competition [7], the domain-based local pair natural orbital (DLPNO) CCSD(T) method within the ORCA 4.0 program suite was utilized to predict logP [41,42,43]. For the DLPNO calculations, the coulomb-exchange fitting correlation consistent auxiliary basis set (cc-pVTZ/JK) was used in conjunction with the RIJCOSX approximation [44]. The RIJCOSX approximation is a resolution-of-the-identity (RI) approximation that calculates Coulomb integrals and uses a semi-numerical integration technique to calculate the exchange integrals. Within ORCA, the TightPNO setting was utilized to reduce the number of screened pair natural orbitals (PNOs) from the DLPNO-CCSD(T) calculation. Solvation effects were modeled using the SMD implicit solvent model for water and 1-octanol within ORCA 4.0.

To further expand on the application of the DLPNO methods, a combination of the DLPNO-ccCA [16] and Solv-ccCA [15] ab initio composite strategies, denoted as DLPNO-Solv-ccCA, was used. Outlined in Table 3, the DLPNO-Solv-ccCA approach utilizes the DLPNO methods and the SMD solvation model within the ccCA approach to calculate the free energy of solvation, which can be applied to solve for the logP following Eq. 1. The ccCA methodology has been extensively described in literature [13, 14, 45, 46].

Table 3 A schematic for DLPNO-Solv-ccCA

A brief description of the ccCA methodology includes using Dunning’s augmented correlation consistent basis sets (aug-cc-pVnZ, n = D, T, Q) to extrapolate to the complete basis set (CBS) limit for the RI-HF and DLPNO-MP2 separately. Various extrapolation schemes have been developed to extrapolate electronic energies to the CBS limit, including a three-point extrapolation scheme based on the ζ-level of the basis set and a two-point extrapolation incorporating the maximum angular momentum of the basis set. For this work, an average of Peterson’s three-point extrapolation and the Schwartz two-point extrapolation using triple- and quadruple-zeta level basis sets was used in accordance with previous ccCA studies [14, 16, 47, 48]. Additive corrections are included to account for core-core correlation (ΔCC), core-valence correlation (ΔCV), and scalar relativity (ΔDK) using the DLPNO methods.

To predict the logP values with ab initio approaches, the logP was calculated with a vertical solvation method following Eq. 1 below

$$ {\text{logP = log}}\left( {\left[ {{\text{solute}}} \right]_{{{\text{octanol}}}} /\left[ {{\text{solute}}} \right]_{{{\text{water}}}} } \right) = \left( {\Delta {\text{G}}_{{{\text{water}}}} - \Delta {\text{G}}_{{{\text{octanol}}}} } \right)*\left( {{\text{log}}_{{{\text{1}}0}} e} \right)/kT $$
(1)

where k is Boltzmann’s constant, T is temperature, e is Euler’s number and ΔG is the free energy of solvation in water and octanol.

Results

All methods presented correspond to 12 SAMPL6 submissions (IDs: rs4ns, hsotx, c7t5j, 5t0yn, jc68f, fe8ws, 7gg6s, f3dpg, arw58, 4p2ph, cr3hs, and kxsp3); however, the results presented do not correspond to the predictions submitted to the challenge (see Table S2). Inconsistencies between the target structure and the calculated structure were identified for five of the SAMPL6 challenge molecules containing the 4-aminoquinazoline backbone, which originated from an initial guess of the molecular structure obtained by converting SMILES strings to 3D coordinates—resulting in poorer predictions (averaging 0.4 and 1.5 logP units for ML and QM approaches, respectively)—and required re-optimizations of the structure from a better initial guess to obtain the target structure (see Fig. S1). The structural incongruities were revised post-submission and are presented below.

Machine learning models

The initial training set of 97 molecules each had 19 molecular descriptors that were used to determine the response variable, logP. Using a univariate analysis to correlate each descriptor to the experimental logP values, the quantity of useful descriptors was reduced. Although most of the molecular descriptors had little to no correlation to the logP, some of the results followed trends previously identified by Reddy et al. [49], specifically, the van der Waals volume and dipole moments were highly correlated to the logP. As shown in Table 4, descriptors such as the superdelocalizability of the HOMO (SHOMO) and nucleophilic superdelocalizability that were expected to correlate well with logP predictions were poorly correlated in this training set. Polarizability, a descriptor not shown in Reddy’s analysis, had the highest correlation overall with an R2 of 0.44. Additionally, molecular weight had a higher correlation to logP (R2 = 0.36), however, this descriptor also had high correlation to the VDWV and VDWA descriptors. As molecular properties calculated in water and octanol, including HOMO, LUMO, dipole moment, were very highly correlated to each other with R2 values of above 0.98, only descriptors in the water phase were considered. By removing the extraneous descriptors that have little correlation to logP, overfitting in the machine learning model was minimized.

Table 4 Correlation between the molecular descriptors to known logP values for the training set

For the first approach, denoted ML-1, a multilinear regression (MLR) model was constructed using three descriptors that best correlated the training model: VDWV, polarizability, and the aqueous PBE dipole moment of the molecule. This approach captured roughly 65% of the variance in the training set and had a root mean squared error (RMSE) of 0.62. With a five-fold cross validation (CV), this approach yielded a RMSECV of 0.76. As illustrated in Fig. 2, the descriptors with the highest correlation to known logP values for the training set molecules have a linear association, however, for the challenge molecules, there is less overlap, indicating the need to modify the descriptors or the training data for a better fit.

Fig. 2
figure 2

The three descriptors selected for in ML-1 model (VDWV, Dipole Moment, and Polarizability). The training set (blue) occupies a slightly different region of the plot than the challenge molecules (red), which can potentially lead to prediction errors due to poor model fit

In the second approach, denoted ML-2, a more robust model was constructed. Using the predictions from the MLR model, molecules determined as outliers were removed from the training set. An outlier is defined as any molecule that has an error in the predicted logP that deviates 3 times the standard deviation of the prediction error or greater from the entire training set. Based on this criteria, seven outliers were removed, leaving 90 molecules remaining in the training set. A partial-least-squares (PLS) regression model was used on the curated training data, which is a useful technique that can reveal intercorrelation among the descriptors along with structural similarities of the molecules in the training set that aid in optimizing the model [50]. To reduce the dimensionality of variables under consideration in this model, two latent variables, which are used to clump together variables to represent a commonality, were used. The two-component PLS model explained 72% of the variance in the training set, with an RMSE of 0.55 and an RMSECV of 0.65 via five-fold cross-validation, which was an improvement from MLR model used in ML-1.

To expand on ML-1 and ML-2, the list of descriptors considered was expanded by 55 additional 2D and 3D descriptors for the 97-molecule training set. Within the expanded set of molecular descriptors, several descriptors highly correlate (R2 ≈ 0.8) with other descriptors (Fig. 3). For example, the van der Waals descriptors for area and volume (vdw_area and vdw_vol) correspond to some of the topological indices based on graph theory, including Wiener (wienerPath, wienerPol) and Zagreb indices. QSAR descriptors for solubility (logS) and molar refractivity (h_mr) highly intercorrelate with descriptors for the partition and distribution coefficients (h_logD and h_logP, respectively), and moderately correlate (R2 ≈ 0.5) with various descriptors for the water accessible surface area (ASA, ASA+, and ASA-). Using a similar approach used for ML-1, the full list of descriptors was reduced by selecting descriptors that had high correlation for predicting logP within the training set and were disparate. In the third approach, ML-3, eight descriptors were selected: apol, ASA_H, a_hyd, GCUT_SMR_3, mr, vdw_vol, vsa_hyd, and polarizability. A PLS model with three latent variables was derived, capturing roughly 77% of the variance in the training set and yielding an RMSE of 0.50 and an RMSECV of 0.69.

Fig. 3
figure 3

Heat map representing the pairwise descriptor correlation of the 74 molecular descriptors

In the fourth approach, ML-4, a total of 74 descriptors were used. A three-component PLS model was used to detect collinearity among variables and reduce overfitting. From this extended list of descriptors, a PLS model with 3 components was derived and captured roughly 85% of the variance in the training set and yielded an RMSE of 0.41 and an RMSECV of 0.56. By contrasting the reduced variables, similarities among the molecules in the training set and test set can be identified. In Fig. 4, the challenge molecules are quite dissimilar to the training molecules. The PLS method was able to successfully identify the collinearity among the descriptors and reduce the overall model complexity.

Fig. 4
figure 4

First three PCA components of the descriptors in ML-4 model obtained for the training (blue) and challenge (red) molecule data sets

The logP values for the eleven challenge molecules were predicted using the four machine learning models that were constructed using different algorithms and molecular descriptors (Table 5). Applying the ML-1 approach to the challenge molecules, the MLR model overestimated the logP with a mean absolute error (MAE) of 1.07 logP units. Using the RMSECV as the model uncertainty, only five of the challenge molecules were predicted within range based on the model calibration. In contrast, using the PLS regression model and the curated training set for the challenge molecules, performed better than the ML-1 approach, with an MAE of 0.97 logP units; however, the ML-2 approach overestimated the logP in all cases, except for molecule SM15, and was only able to predict four molecules within the range of the predicted model uncertainty. With the inclusion of eight descriptors, the ML-3 model underestimates the logP for the challenge molecules, excluding SM13 and SM16, yielding an MAE of 1.01 logP units with respect to experiment. This approach was able to accurately predict the logP of SM13. Although the MAE increases from that of ML-2, the ML-3 model was able to determine logP within the expected uncertainty for six of the challenge molecules with respect to experiment. The best approach, ML-4, had an MAE and RMSE of 0.69 and 0.87 logP units, respectively. While this model generally underestimated the logP by an average of 0.68 logP units, with respect to experiment, it was only able to adequately predict the logP values for five challenge molecules within the model uncertainty. Regarding challenges to the molecule set, SM12, SM15 and SM16 were each predicted within the model uncertainties for three or more approaches, whereas SM07 and SM11 were outside of all windows of model uncertainty.

Table 5 Predicted logP values of all machine learning approaches

Electronic structure calculations

Electronic structure methods, although useful in providing a framework for quantum mechanical-based machine learning approaches, have the capacity alone to predict physical properties in the absence of experiment without the need to train large amounts of data focused around targeted applications.

To evaluate the utility of quantum chemical approaches, density and wavefunction-based methods were used to predict the logP for the challenge molecules. For DFT approximations, GGA and hybrid-GGA functionals were employed to consider the role of parameterization and exact exchange within density functionals. As shown in Table 6, all six density functional methods considered underestimate the logP. Overall, BLYP was the best functional, with a MAE of 0.77 logP units and a max deviation of 1.51 logP units. Among the GGA functionals, the PBE and PW91 functionals performed similarly, having an MAE of 0.80 and 0.81 logP units, respectively. Comparing the hybrid-GGA functionals, B3LYP predicts the logP better than B3PW91 and PBE0. Regarding the contributions of including exact exchange, the addition of exact exchange to the respective GGA functional results in further underestimations of the logP ranging from 0.05 to 0.27 logP units.

Table 6 Predicted logP values with various density functionals with the cc-pVTZ basis set

Taking a more aggressive approach, linear scaling wavefunction based methods were used to examine the efficacy of ab initio methodologies for predicting the logP and expand the applications of ab initio methodologies towards molecules of increasing size and complexity. Initially, DLPNO-CCSD(T) was used with a triple-ζ level basis set to predict the transfer free energies to estimate the logP (Table 7). This method underestimated the logP for most of the challenge molecules, excluding SM09 and SM14, resulting in an MAE of 0.50 logP units.

Table 7 Predicted logP values determined with DLPNO-CCSD(T) and DLPNO-Solv-ccCA

While CCSD(T) is known for predicting energetic properties of small organic molecules typically within 1.0 kcal/mol (1.36 logP units), composite methods, in particular ccCA, have been shown to reach similar accuracies, and in some cases, achieve even better results than CCSD(T)/complete basis set limit calculations, but at reduced computational cost [51]. At the time of the challenge, as the experimental results were unavailable, the DLPNO-Solv-ccCA composite strategy was used with intent to predict logP values with a targeted accuracy equivalent to that of DLPNO-CCSD(T)/aug-cc-pCV∞Z-DK. Similar to the results obtained by DLPNO-CCSD(T), the composite method underestimated the logP for most of the challenge molecules (MAE = 0.51 logP units), with the exception of SM09 and SM14, in which the composite approach was 0.03 logP units closer to the experiment values than the DLPNO-CCSD(T) method.

Discussion

Machine learning

Machine learning is popular in modern-day computing and has the potential to be very useful for understanding complex chemical phenomena; however, the learning ability is limited by the quality and quantity of the data. For the purposes of this challenge, the NIST Chemical WebBook’s reference of logP values was used to create a known relationship between the structures of similar molecules to the challenge set and their logP values, and the relationship was then transferred to the challenge molecules. Typically, the training data set should be very similar to the set to which the model would be applied. Unfortunately, this was not true for this investigation because the challenge molecules were larger than molecules available from the database and contained more complex heterocycles (see Tables S3 and S4).

For the generation of quantum chemical molecular descriptors, choosing an inexpensive, yet useful electronic structure method can be challenging. Considering the functionals employed in this study, it was intentional to select functionals within the same developer family, i.e. PBE and PBE0, BLYP and B3LYP, as well as PW91 and B3PW91, to maintain consistency among parameters within each functional. Additionally, these functionals were chosen due to their ability to predict HOMO-LUMO gaps, one of the quantum chemical molecular descriptors, for a set of small main group molecules [52]. By transferring this concept to machine learning, this allows for greater transparency as to which molecular descriptor is more correlated with the desired property so that any empirical parameters from significantly different functional types will not bias the outcome. Building upon this idea, PBE and PBE0 were chosen as the methods for determining quantum chemical descriptors for the machine learning models based on the use of no empirical parameters used to develop the functionals. This will allow the machine learning model to better gauge trends between different molecular descriptors.

The predicted results of the challenge molecules appeared to improve as descriptors were added, illustrating that the descriptors were accounting for molecular properties that correlated to logP, and that the training set molecules were sufficiently structurally relevant to the challenge set molecules to permit the machine learning algorithms to identify useful patterns. As shown in Fig. 5, for the ML-1 model, the challenge molecules were biased to more hydrophobic predictions, with respect to experiment. Using the same molecular descriptors and removing outlier molecules from the training set did not result in a significant improvement for the challenge molecules. On the contrary, while using additional descriptors improved the correlation between predicted and experimental logP values for the training set, selecting only a smaller subset of descriptors as done in the ML-3 model shifted the predictions of the challenge molecules to be more hydrophilic, whereas using many molecular descriptors adds to the flexibility of the model, as the best predictions were obtained using the ML-4 model. As shown in Fig. 6, several descriptors used in ML-4 attribute to the improvement of the model, including logP(o/w), h_logD, h_logP, h_logS, logS, a_hyd, and vsa_hyd, as these descriptors are highly correlated to the experimental logP for the training set of molecules.

Fig. 5
figure 5

Comparison of the four machine learning models

Fig. 6
figure 6

Correlation between the molecular descriptors used in the ML-4 model to logP values for the training set

Several common machine learning techniques were used for the challenge, including principle component regression (PCR), partial least squares (PLS), and support vector machines (SVM) [50, 53,54,55]. Both PCR and PLS are linear methods which group descriptors into orthogonal components and reduce overfitting by removing excess components. SVM is an alternative, generally non-linear method that draws boundaries to best fit data. To help prevent overfitting, five-fold cross-validation was used to validate the models. In this cross-validation approach, data is randomly split into five roughly equal subsets, and then 80% of the data are iteratively selected and used to build a model which predicts the remaining 20%, giving an estimate of how the model would perform on data not used for training [21]. Additionally, the leave-one-out (LOO) method for cross-validation was considered to detect outliers, as it is a more extensive technique than five-fold cross-validation. In this cross-validation approach, one molecule of the full training set is selected for testing, whereas the remaining molecules are used for building the model, calculating the logP and RMSE per iteration. Outliers were identified by logP values greater than 3 times the standard deviation of the error. The LOO method identified the seven molecules previously determined as outliers via five-fold cross-validation and two additional molecules. Results for all methodologies were fairly similar, which implies that all models were converging to a constant value.

Comparing the relative computational cost of the quantum chemical methods used, the DFT methods scale approximately N3N4 based on the complexity of the functional whereas the linear scaling wavefunction based methods scale N5, in which N refers to the system’s size (the number of basis functions). In the traditional ccCA method, determining the higher-order electron correlation contributions is the step most-time consuming, scaling N7, however, DLPNO-Solv-ccCA scales as N5 because of the reduced number of orbital triples, which arises from having three electron pairs survive the electron pair screening process, included in the DLPNO-CCSD(T) calculations [56, 57]. Given the size of the molecules for both the training and challenge sets, it is possible to expand the quantity and quality of quantum chemical descriptors to improve the predictability of the machine learning models. As the optimization of the molecular structure is necessary to generate 3D-descriptors, the protocol could be augmented by employing semiempirical or more approximate methods to obtain the initial starting guess of the molecular structure, allowing for the expansion of larger data sets to be used in the training of machine learning models.

Observing trends among the four machine learning models against the challenge set, adjustments to each model resulted in a significant improvement for most challenge molecules. SM11 was a notable outlier and was poorly predicted in each model. In contrast to other challenge molecules, the predicted van der Waals volume as well as other descriptors were separated from the median and are believed to be the source of error the machine learning predictions. Furthermore, six of the challenge molecules (SM02, SM04, SM07, SM09, SM12, SM13) represent the 4-amino quinazoline scaffold present in protein kinase inhibitors as anticancer agents [50], that primarily differ by substituents on the 4-amino. Ranking the logP of these 4-amino quinazoline derivatives, the four machine learning models were able to properly rank some of the molecules (Table 8). The ML-1 and ML-2 models properly ranked SM09 and was close (within 1 order) to properly ranking SM12, SM04, and SM02. The ranking of the six molecules is actually identical for both methods. As the only difference between both methods is the algorithm used, this highlights that the descriptors have a great weight into determining these substituent effects. The ML-3 model correctly ranks SM07 and SM02 and was close to properly ranking SM09 and SM04. Oddly, the best model with the lowest MAE, ML-4, failed to correctly rank the six derivatives but was able to predict SM13 and SM02 within one ranking unit. These trends reveal that some descriptors, or possibly the use of many descriptors, increases the resolution overall for machine learning, but decreases the resolution for contrasting substituent effects.

Table 8 Ranking the logP predictions in order from smallest to largest logP for the molecules with the 4-amino quinazoline scaffold

Electronic structure

All density functional approaches underestimated the logP values [as indicated by the negative sign for the mean signed error (MSE)] and resulted in equal magnitudes between the MSE and MAE. Functional choice was a factor in logP prediction. The GGA functionals (BLYP, PBE, and PW91) yielded a lower MSE, MAE, and RMSE than the hybrid-GGA functionals (B3LYP, PBE0, and B3PW91) by approximately 0.10 to 0.13 logP units. This implies that the inclusion of exact exchange does not benefit the energetics for the prediction of water-octanol partition coefficients. The BLYP functional yielded the highest logP values, which correspond to the lowest MAE and RMSE among the examined functionals. Across all functionals, SM07 yielded the largest error in the logP of approximately 1.61 logP units and SM09 yielded the smallest error of approximately 0.09 logP units.

In considering more rigorous methods known to predict properties within 1.0 kcal/mol of experiment for small organic molecules, a question is whether or not the increase in computational expense makes an important impact in the prediction of partition coefficients of drug-like molecules. For this SAMPL challenge, there is a useful impact. The DLPNO-CCSD(T) and DLPNO-Solv-ccCA methods were among the best performing methods examined in this study. Overall, both methods resulted in underestimations of logP, except for SM09 and SM14. In general, DLPNO-Solv-ccCA yielded lower logP values than DLPNO-CCSD(T) by less than 0.06 logP units for all molecules with the exclusion of SM08, which for DLPNO-Solv-ccCA increased the predicted logP unit by 0.02 logP units. DLPNO-Solv-ccCA includes DLPNO-MP2 calculations, which contributed to the lowering of predicted logP relative to DLPNO-CCSD(T).

For DLPNO-Solv-ccCA, the most significant contribution to the predicted logP value was the core-core correlation contribution (ΔCC) (Table 9). In comparing the DLPNO-CCSD(T)/cc-pVTZ logP values to those obtained with DLPNO-MP2/CBS + ΔCC, which effectively models DLPNO-CCSD(T)/CBS, the additional contribution reduced the predicted logP values by 0.02 to 0.08 logP units, with the exception of SM15, which increased the logP value by 0.01 logP units. The largest difference in magnitude between DLPNO-CCSD(T)/cc-pVTZ and DLPNO-MP2/CBS + ΔCC logP values is for SM09 and SM13 at 0.08 logP units. This infers that including a larger basis set would decrease the predicted logP value without having to calculate logP values at the DLPNO-CCSD(T)/aug-cc-pVQZ level.

Table 9 Decomposition of DLPNO-Solv-ccCA predicted logP values

The core-valence contribution (ΔCV) was the next largest contribution as the logP values changed by less than 0.05 logP units. This is most likely due to the small number of additional electrons from the core included explicitly in the calculation with only the 1s shell for first-row atoms and 2s and 2p shells for chlorine atoms. The largest change in logP value from the inclusion of the core-valence contribution to the logP value for DLPNO-MP2/CBS + ΔCC was an increase by 0.05 logP units for SM09. As well, the predicted logP value decreased when including the core-valence contribution for SM07, SM14, and SM15, by 0.01, 0.01, and 0.02 logP units, respectively. This shows that there is no general trend with the inclusion of the sub-valence electrons in the logP calculation. For relativistic corrections, the predicted logP values were altered by only ≤ 0.01 logP units relative to logP values generated with DLPNO-MP2/CBS + ΔCC + ΔCV energies. Since logP is a relative quantity as shown in Eq. 1, this essentially negates the effect of relativity in solvent phase. Overall, the core-valence and relativistic contributions obtained at the DLPNO-MP2 level resulted in a small change (0.05 logP units) in the predicted logP values relative to DLPNO-CCSD(T)/cc-pVTZ.

Revisiting the 4-amino quinazoline derivatives, the DLPNO-CCSD(T) method did not correctly rank any of the six molecules but was close to ranking 4 of the 6 molecules (Table 8). With the core-core correlation correction (ΔCC) included in the composite approach, the trends are nearly identical to the DLPNO-CCSD(T) predictions with the exception of SM04, for which DPLNO-Solv-ccCA predicts the correct rank. Notable outliers in which the electronic structure methods had the most difficulty in determining logP values are SM02 and SM07. The error in the logP for both molecules was larger than the other molecules within the challenge set. The structure of SM02 is different from other molecules in that it has a trifluoromethyl phenyl group attached to the 4-amino quinazoline. As modeled, the challenge molecules are generally planar. Considering various low-energy conformations may result in improvements to logP predictions. The treatment of electrostatics in solution have an impact on the prediction. Including the HCl counterion was significant in the logP prediction for both SM09 and SM12. For example, removing the counterion for SM09 decreases the predicted value by 1.4 logP units using the DLPNO-CCSD(T) method. Incorporating the ionic strength of the solution may be a useful consideration for future improvement.

Conclusion

For the electronic structure methods, the GGA functionals (BLYP, PBE, and PW91) yielded lower MAEs than the hybrid-GGA functionals (B3LYP, PBE0, and B3PW91) with BLYP yielding the lowest MAE among all utilized density functionals. DLPNO-CCSD(T) and DLPNO-Solv-ccCA yielded the lowest MAE of 0.50 and 0.51 logP units, respectively, which was the lowest among all methods examined in this work. The core-valence and scalar relativistic corrections altered the DLPNO-Solv-ccCA logP values by less than 0.05 logP units.

The machine learning approaches tended to yield favorable results with an overall MAE of approximately 1 logP unit but were susceptible to the choice of data used in the training set and in the choice of descriptors. This is generally one of the drawbacks of machine learning algorithms, as there are often no clear systematic ways to improve the accuracy beyond adding more data to the model, which may not be readily available. Overall, the machine learning approaches showed potential for further development of machine learning methods towards logP prediction with a relatively small training set.