Introduction

Lipophilicity is the most important physicochemical property in drug discovery and a key design parameter in medicinal chemistry [14]. Lipophilicity has traditionally been linked [58] to permeability although it has long been recognized that high lipophilicity is associated with poor aqueous solubility [9] and is an undesirable feature in compounds intended to be drugs [10]. Lipophilicity considerations feature prominently in the well-known ‘Rule of 5’ (Ro5) [11] which is essentially a statement of physicochemical property distributions for compounds that had been taken into Phase II clinical studies at some point before the publication of the original study. Although invoked frequently, and occasionally outside its applicability domain of oral absorption, Ro5 provides no guidance as to how compliant compounds should be optimized. It is also unclear why the high polarity limit for Ro5 is specified in terms of hydrogen bonding while the low polarity limit is defined by lipophilicity. The wide acceptance of Ro5, and the popularity of approaches to data presentation that hide or mask variation, tend to blind drug discovery scientists to the possibility that lipophilicity may be less predictive of outcomes, such as pharmacological promiscuity, than is commonly believed [12].

Lipophilicity is usually quantified as a partition coefficient (P) and the nature of solute partitioning between immiscible solvents has been understood for many years [13]. The distribution coefficient, D, of compound X may be defined as the ratio of concentrations of X in solvents S1 and S2, where [Xi](S1) and [Xi](S2) are the concentrations of form i of the compound in solvents S1 and S2 respectively:

$$\text{D }=\text{ }({{\Sigma }_{\text{i}}}\left[ {{\text{X}}_{\text{i}}} \right]\left( \text{S1} \right)\text{ })/\text{ }({{\Sigma }_{\text{i}}}\left[ {{\text{X}}_{\text{i}}} \right]\left( \text{S2} \right)\text{ })$$
(1)

The partition coefficient P is usually defined as the ratio of the concentrations of the neutral form of X in the two solvents:

$$\text{P }=\text{ }\left[ {{\text{X}}_{\text{neutral}}} \right]\left( \text{S1} \right)/\left[ {{\text{X}}_{\text{neutral}}} \right]\left( \text{S2} \right)$$
(2)

Partition coefficients in drug discovery are conventionally defined with S1 as the organic solvent and S2 as water which means that the partitioning system may be specified by the organic solvent (e.g., Poct for octanol/water; Pchx for cyclohexane/water; Phxd for hexadecane/water). Partition coefficients are usually quoted as their base 10 logarithms and, in this study, we will use the abbreviations ‘logP’ and ‘logD’ for the base 10 logarithms of P and D with subscripts to indicate the organic phase (e.g., logPoct). The most commonly used organic solvent for lipophilicity measurement is octanol [14, 15] and a number of methods exist for prediction of logPoct [16]. The aqueous phase is typically buffered (e.g., pH 7.4) for lipophilicity measurements and it is D (as opposed to P) that is actually measured. The distribution coefficient, which is a function of pH, and P are identical for compounds that are not significantly ionized at the measurement pH. Making the assumption that only neutral forms of compounds partition into the organic phase, D can be written as a function of P and the fraction, Fneut, of compound present as neutral form in the aqueous phase [17]:

$$\text{log D}\left( \text{pH} \right)\text{ = log P + log }{{\text{F}}_{\text{neut}}}\left( \text{pH} \right)$$
(3)

In some cases [18, 19], ionized forms of compounds do partition into the organic phase and, in these situations, D also depends on the nature and concentration of counter ion(s). If required, logP can be obtained from the logD-pH profile or by applying Eq. (3) with a measured pKa value. However, neither of these approaches is routinely used in drug discovery programs and the logP values quoted for compounds that are significantly ionized are usually calculated rather than measured.

Like molecular size, lipophilicity can be regarded as a risk factor in drug discovery and the most direct way to monitor it during the course of a lead optimization project is to plot the response of potency to lipophilicity [20]. Provided that it is not simply a reflection of a narrow range in the data, a weak correlation between potency and lipophilicity is actually desirable because it indicates that the discovery project team has room to maneuver. When potency and lipophilicity are more strongly correlated, the response of the former to the latter should be as steep as possible and this consideration can also be used to assess different structural series within a project. It is also useful to model the response of pIC50 to logP (or logD) because this allows potency to be ‘normalized’ with respect to risk factor and the residuals quantify the extent to which the activity of a compound beats (or is beaten by) the trend in the data [20]. Andrews et al. used residuals in an analogous manner in their 1984 study of functional group contributions to drug-receptor interactions [21]. Subtraction of logD [22] or logP [23] from pIC50 was suggested for normalization of activity with respect to lipophilicity and the difference between pIC50 and logP (or logD) subsequently became known as ligand lipophilicity efficiency or lipophilic ligand efficiency (LLE) and lipophilic efficiency (LiPE) [24]. The difference between potency and logP can be interpreted as a measure of the ease of transferring the neutral form of a compound from an organic solvent (usually octanol) to its binding site although this interpretation is no longer valid when compounds bind to targets in ionized forms [20]. LLE/LiPE will appear to decrease with lipophilicity if the gradient of a linear response of potency to lipophilicity is less than unity and this should generally be considered as a characteristic of the structural series rather than interpreted in terms of ‘quality’ of individual compounds.

The octanol/water partitioning system is arbitrary and it has been suggested [17] that its adoption may reflect misinterpretation of work by Collander [25] who was aware of the relevance of the hydrogen bonding characteristics of the organic phase to partitioning. Octanol can form hydrogen bonds with solutes on account of the hydroxyl group in its molecular structure and its high water content at saturation (2.5 M; equivalent to mole fraction of 0.29) [26] is greater than that of cyclohexane (0.003 M) [27] or hexadecane (0.002 M) [28]. It has been argued [29, 30] that a hydrocarbon solvent is a more appropriate model for the lipid bilayer core. The alkane/water partition coefficient (logPalk) provides a more direct measure of aqueous solvation energy [17, 3134] than its octanol/water counterpart (logPoct) while being more amenable to measurement than gas to water transfer free energy [35]. It has also been suggested that a solvent lacking hydrogen bonding capacity would represent the most appropriate reference state for normalizing potency with respect to lipophilicity [20]. Alkane/water partitioning systems are also more sensitive than octanol/water to changes in polarity resulting from conformational biasing and intramolecular hydrogen bonding [36]. Cyclohexane [37], and other hydrocarbon solvents such as hexadecane [28, 29] have been used for logP measurement for many years [3859]. The difference between logPoct and logPalk provides a measure of solute hydrogen bonding capacity and is of considerable interest in its own right [39, 41, 4446, 51, 53, 54]. It is usually given the symbol ΔlogP and it is effectively an octanol/alkane partition coefficient where both phases are saturated with water.

Although cyclohexane and hexadecane are the most commonly encountered organic solvents in alkane/water partitioning studies, other hydrocarbon solvents are also used and it is typically necessary to aggregate measurements for different alkanes (and different experimental protocols) for modelling studies [17]. The term ‘alkane/water partition coefficient’ (logPalk) is used both as a generic description of measurements made using very similar partitioning systems and to acknowledge that data has been aggregated for analysis. Compounds of interest to medicinal chemists tend to be poorly soluble in saturated hydrocarbons and this presents challenges for measurement of alkane/water partition coefficients. Self-association is more of a concern for logPalk measurement than for logPoct. Just as ionization in the aqueous phase makes compounds appear to be less lipophilic than they actually are, self-association in the organic phase effectively masks polarity and results in an increase in apparent lipophilicity. Furthermore, differences in spectral characteristics (e.g. dimer absorbs more strongly than monomer) have the potential to exaggerate effects of self-association. However, partitioning of ions into the organic phase is less likely for hydrocarbon solvents than for octanol and the low solubility of water in the former reduces the likelihood of interactions with other solutes that can lead to ‘water-dragging’ [56]. Measurement [34, 48, 5759] and prediction [17, 33, 4952, 6062] of logPalk are both areas of active research.

The presence of hydrogen bond (HB) acceptors and donors in the molecular structure of a solute favors aqueous solvation and tends to make the solute less lipophilic. The less polar the organic phase, the greater the sensitivity of logP to solute hydrogen bonding capacity although it should be noted that contact between polar and non-polar molecular surfaces is not inherently repulsive [63]. HB acidity and basicity are usually quantified as association constants for 1:1 complexes in low-polarity solvents such as carbon tetrachloride or 1,1,1-trichloroethane and a large body of measured data (mainly HB basicity) is available [6468]. Calculated molecular electrostatic potential (MEP) is an effective predictor of both HB acidity [69] and HB basicity [63, 7072]. Minimized MEP (Vmin) reflects the electronic distribution within atoms and is arguably more relevant to intermolecular interactions than atomic charges which describe the electronic distribution between atoms [63]. MEP minima cannot, in general, be reproduced by atom-centered (or bond-centered) multipoles [63]. Vmin can be thought of as a ‘lone pair’ descriptor that is capable of explaining why pyrazine can accept a hydrogen bond despite lacking a permanent dipole moment. When relating measured HB acidity/basicity (1:1 complex) to solvation behavior, it is important to be aware that HB donors and acceptors of solute interact simultaneously with a number of solvent molecules (1:N complex) [63, 68]. HB acidity/basicity measured for a polyfunctional compound with non-equivalent HB donors/acceptors is not generally meaningful [63, 68] unless individual contributions to the overall formation constant can be determined [73]. Despite these limitations, HB acidity and basicity considerations can provide insight into partitioning phenomena just as partition coefficient measurements can provide insight into the nature and strength of hydrogen bonding. Taken together, formation constants of 1:1 hydrogen bonded complexes and partition coefficients complement views [74, 75] of molecular recognition that are more based on analysis of X-ray crystal structures.

In this perspective we first show how analysis of logPalk measurements can be used to quantify polarity of both compounds and substructures. We then illustrate the connection between polarity defined in this manner and hydrogen bonding by using examples of polar atom types (e.g. HB donors; aromatic nitrogen) and substructures (e.g. aromatic rings).

Computational details

ADD_CENTRE [63] and MEP2HB were created with the OEChem [76] toolkit which was also used with the OESpicoli toolkit [77] to create ClogPalk [17]. Each of the ADD_CENTRE, MEP2HB and ClogPalk programs uses the OpenEye [78] implementation of SMARTS [79, 80] to specify substructures. Source code and documentation for these three programs and READ_GAUSS_FILE is provided as supplementary material.

Molecular structures were encoded as isomeric SMILES [81, 82] strings and Omega [83, 84] was used to generate a single conformation for each. Molecular geometries were energy-minimized in gas phase (MMFF94S) [85] using the Szybki [86] molecular mechanics program. Molecular surface area (MSA) was calculated from atomic coordinates and Bondi [87] radii using ClogPalk with a probe radius of 1.4 Å. Minimized molecular electrostatic potential [63, 70, 71] was calculated with Gaussian 09 [88] using the Hartree–Fock [89], B3LYP [90, 91] or MP2 [89, 92, 93] theoretical models with 6-31G** or 6-311 + G** basis sets [9496]. The ADD_CENTRE software was used to calculate starting coordinates for MEP minimization by placing points on conventional ‘lone pair’ axes at distances that were typically between 1.3 and 1.5 Å from the relevant nucleus. The version of ADD_CENTRE (1.1) used in this study differs from the version (1.0) used previously [63] in that it provides additional functionality to probe-systems and handle nitroso oxygen. Starting points for MEP minimization with π-systems were generated by placing points on normals to the plane of symmetry that either pass through atomic nuclei or bond centroids at distances in the range 1.5 to 2.0 Å. MEP minima are typically more difficult to locate for aromatic rings than for heteroatoms and ADD_CENTRE has a feature that allows a normal passing through a bond centroid to be rotated around the bond axis. Vmin values were extracted from Gaussian 09 output using READ_GAUSS_FILE and HB basicity (pKBHX) values were calculated for these using MEP2HB which applies models derived in a previous study [63]. For each atom type, Vmin was calculated at the level of theory corresponding to the most predictive model for pKBHX. An updated file of models for prediction of pKBHX from Vmin is provided as supplemental material.

Measured alkane/water partition coefficients were taken from the literature and classified as CHX (cyclohexane), HXD (hexadecane) or ALK (alkane other than cyclohexane or hexadecane) according to the organic solvent. Unless otherwise stated, data in these three categories were aggregated for analysis and a file of 1144 values measured for 812 compounds is provided as supplementary information with links to their respective literature sources. Files of 453 measured HB basicity (pKBHX) values and 63 measured pKa values are also made available as supplementary information. Octanol/water partition coefficients were taken from a published compilation [97] or, in the case of 1,5-naphthyridine, from primary literature [51].

ClogPalk [17] was used to calculate reference logPalk values from MSA. The reference values used in the analyses for heteroaromatic nitrogen and carbonyl oxygen accounted for polarity of benzylic substituents by subtraction of the following correction factors: benzyl (1.07), 3-chlorobenzyl (1.09) or 4-phenylbenzyl (1.78). Correction was only made the presence of benzylic substituents for one data point in the analysis for aromatic nitrogen and three data points in the analysis for carbonyl oxygen. The version of ClogPalk used in the current study differs from that described previously [17] in the way that SMARTS patterns are matched. Previously, the parameter associated with a SMARTS pattern was only assigned to the atom mapping onto the first atom of the SMARTS string. In the current version (1.1), the parameter associated with a SMARTS pattern is assigned to all atoms that map onto that SMARTS pattern. Updated parameter files for the ClogPalk model that are compatible with the current version of the software are provided as supplemental material.

MUDO [98] was used for Matched Molecular Pair Analysis (MMPA) [99104] and all statistical analysis was performed with JMP [105]. The predictive models for pKBHX and polarity used in this study (M01 to M16) are provided in Table 1.

Table 1 Models for prediction of polarity and hydrogen bond basicity
Fig. 1
figure 1

Plot of measured logPalk against molecular surface area for saturated hydrocarbons (grey circles; model M01 in Table 1), saturated alcohols/diols (red diamonds; models M02/M03 in Table 1) and saturated ethers (green diamonds; model M04 in Table 1). Other than saturated carbon, only the atoms defining each functional group are present in molecular structures for this data set

Fig. 2
figure 2

Plot of polarity against measured HB basicity for structurally prototypical compounds with only a single HB acceptor or two symmetrically equivalent HB acceptors in their molecular structures. Data has been plotted on a per-HB acceptor basis by normalizing both Q and pKBHX by the number (n) of HB acceptor atoms and details for the least squares line of fit (model M05) are given in Table 1

Fig. 3
figure 3

Minimized Molecular Electrostatic Potential (MEP) for aromatic systems. The Vmin calculations were performed for MMFF94S energy-minimized structures using the MP2/6-311G** protocol and the atomic units (au) are Hartree per elemental charge. a Plot of HB basicity (pKBHX) against minimized MEP (Vmin) for non-fused aromatic compounds lacking conventional HB acceptors (model M06, Table 1). b Plot of polarity, Q, against (Vmin) for N-methylpyrrole, methyl-substituted benzenes and chloro-substituted benzenes. The logPalk value for 1-methylpyrrole has been estimated from the logPalk value measured for pyrrole, making the assumption that the effect of N-methylation will be the same for pyrrole as for indole. c Vmin/au values calculated for π-systems of 1-methylpyrrole, o-xylene and chlorobenzene showing the positions and magnitudes of the MEP minima

Fig. 4
figure 4

Relationship between corrected polarity, Qcorr, and pKBHX calculated for heteroaromatic nitrogen using model M08 (Table 1). A curve (model M12, Table 1) was fit to the filled black circles which correspond to compounds in which heteroaromatic rings are either non-fused or which have a doubly-connected nitrogen atom in each ring (e.g. 1,5-napthyridine). Halogen-substituted pyridines (red circles) and fused heterocycles (red and green diamonds) are shown for reference but were excluded from training set. For furan, logPalk was approximated by logPoct (ΔlogP ∼ 0 for weak HB acceptors) and the measured pKBHX value was used

Fig. 5
figure 5

Predicted logPalk for a selection of heteroaromatic compounds. Substructural polarity (qaromN) values associated with aromatic nitrogen atoms were calculated (M12, Table 1) from Vmin (M12, Table 1). A correction factor (+ 2.0) has been applied for presence of adjacent hydrogen bond acceptors in 10 and 12 (but not 7) and the uncorrected predictions are shown in parentheses for these compounds

Fig. 6
figure 6

Relationship between corrected polarity, Qcorr, and pKBHX calculated for carbonyl oxygen using model M09 (Table 1). The curve (model M12, Table 1) was fit to the black circles which correspond to molecular structures with either one carbonyl group (e.g. cyclopentanone) or two symmetrically equivalent carbonyl groups (benzoquinone). The red diamonds correspond to sulfoxides which were not included in the fitting and the pKBHX values for these compounds were predicted using model M10 (Table 1)

Estimation polarity from measured partition coefficients

The general framework used in this study for relating partition coefficients to hydrogen bond capacity can be summarized as:

$$\text{logP}\left( \text{ref} \right)-\text{logP}\left( \text{expt} \right)\text{ }=f(\varvec{\alpha} ,\varvec{\beta})$$
(4)

In this framework, logP(expt) is the logP value measured for a compound and logP(ref) is logP for a physically meaningful reference state which may either be a measured or calculated value. The HB donor and acceptor capacities for the compound are represented by α and β respectively and these are vectors because, in general, molecular structures have multiple HB donors and acceptors. Equation (4) treats HB donors and acceptors as perturbations of the reference state and exploiting Eq. (4) requires that both reference state and function, f, be defined explicitly. Equation (4) can be used either to estimate HB capacity from logP measurements or to predict logP from calculated HB capacity.

The polarity of a compound may be estimated from measured logPalk by making use of the strong linear relationship (M01, Table 1) between logPalk and MSA that is observed for saturated hydrocarbons. The reference state is a hypothetical saturated hydrocarbon with identical MSA to the compound of interest for which logPalk can be calculated reliably using M01 (Table 1). The polarity, Q, of the compound is defined as the difference between the value of logPalk calculated for this reference state and the measured value:

$$\text{Q }=0.0338\times \left( \text{MSA}/\mathrm{\AA}^{\text{2}} \right)-0.284-\text{log}{{\text{P}}_{\text{alk}}}$$
(5)

Q can be treated as a sum of contributions (qi) from polar substructures where ni is the number of instances of substructure i in the molecular structure of the compound:

$$\text{Q }={{\Sigma }_{\text{i}}}{{\text{n}}_{\text{i}}}\times {{\text{q}}_{\text{i}}}$$
(6)

Equations (5) and (6) form the basis of the ClogPalk model [17] which associates qi values with substructures defined using SMARTS [79, 80] notation and is illustrated graphically in Fig. 1. Equations (5) and (6) can either be used with measured logPalk data to estimate qi or with calculated qi values to predict logPalk. A strong correlation between logPalk and molecular volume was also observed for saturated hydrocarbons and analogous analysis based on that relationship has been reported [45]. If measured logPalk values are available for a number of compounds with only the substructure of interest and saturated carbon present in their molecular structures then the mean value of Q provides a direct estimate of qi for that substructure. This is the preferred approach for estimation of substructural polarity from measured logPalk although its applicability may be limited by data availability. Once qi has been determined directly for substructure i (e.g. benzyl substituent), it can then be used to estimate qj from measured logPalk for compounds with only saturated carbon and substructures i and j in their molecular structures. This approach was used in the parameterization of the ClogPalk model [17] and estimation of substructural polarity in this manner may be termed ‘indirect’. When modelling the response of polarity to HB capacity, it can be useful to correct Q for presence of other HB acceptors and donors since this enables exploitation of more measured data than would otherwise be possible. A corrected value of Q may be defined as follows where n is the number of instances of the HB acceptor (or donor) of interest, and qcorr,i and ncorr,i are, respectively the polarity and number of instances of a substructure i:

$${{\text{Q}}_{\text{corr}}}=\text{ }(\text{Q}-{{\Sigma }_{\text{i}}}{{\text{n}}_{\text{corr},\text{i}}}\times {{\text{q}}_{\text{corr},\text{i}~}})/\text{n}$$
(7)

In this study, Qcorr values were used to model the responses of polarity to calculated pKBHX for heteroaromatic nitrogen and carbonyl oxygen although correction factors were only defined for three substructures: benzyl (1.07), 3-chlorobenzyl (1.09) and 4-phenylbenzyl (1.78).

MMPA [99104] can be used to estimate polarity differences between substructures. A matched molecular pair consists of two compounds that are linked by a specific structural transformation (e.g. carboxyl to tetrazole) that may be regarded as a perturbation of either structure. For example, the effect on logPalk of N-methylation of a secondary amide group may be estimated by averaging the difference in logPalk between secondary amides and their N-methylated analogs:

$$\Delta \text{log}{{\text{P}}_{\text{alk}}}[\text{Amide}:\text{NH}\to \text{Amide}:\text{NMe}\left] \text{ }=\text{ log}{{\text{P}}_{\text{alk}}} \right[\text{R1C(=O)N}\left( \text{Me} \right)\text{R2}]-\text{log}{{\text{P}}_{\text{alk}}}\left[ \text{R1C(=O)N}\left( \text{H} \right)\text{R2} \right]$$
(8)

In general, the structural transformations that define matched molecular pairs result in changes in MSA and this must be accounted for when using MMPA to estimate polarity differences between substructures. For example, the difference in the polarity of substructures 1 and 2 can be written as:

$${{\text{q}}_{{1}}}-{{\text{q}}_{{2}}}=\Delta \text{log}{{\text{P}}_{\text{alk}}}[{1}\to {2}]-\left( 0.0\text{338}/{{\mathrm{\AA}}^{{2}}} \right)\times \Delta \text{MSA}[{1}\to {2}]$$
(9)

The advantage of MMPA is that it allows measured data for compounds with non-equivalent HB donors and acceptors in their molecular structures to be exploited for estimation of polarity.

One advantage of defining polarity in terms of a difference between partition coefficients is that Q is invariant with respect to standard state. Partition coefficients are usually defined in terms of molar concentration units although mole fraction can also be used. Any model for partitioning (or binding) must be able to accommodate a change in standard state definition in order to be considered to have a valid thermodynamic basis. While prediction of partition coefficients is the main focus of this Perspective, measures of substructural polarity derived from logPalk are also of interest for modelling molecular recognition in aqueous media [55]. One of the objectives of this study is to evaluate calculated HB basicity as a predictor of substructural polarity and it is instructive to examine the relationship between Q and measured pKBHX that is illustrated in Fig. 2. The compounds in this data set were selected to have either a single HB acceptor (e.g. cyclohexanone) or two equivalent HB acceptors (e.g. dioxane) which means that Q can be associated with the HB acceptor of each compound. The results shown in Fig. 2 suggest that development of a model for logPalk that is based entirely (i.e. without substructural parameterization) on measures of HB acidity and basicity derived from formation constants of 1:1 hydrogen complexes is unlikely be feasible.

Polarity of hydrogen bond donors

The octanol/water system is relatively insensitive to the presence of HB donors in molecular structures and logPoct is of practically no value in assessing HB acidity [12, 46, 54]. Consequently, it is necessary to use alkane/water systems to study hydrogen bond donors with partition coefficient measurements. Polarity estimates for a number of common HB donors are presented in Table 2. Generally, the presence of an HB donor in a molecular structure implies that at least one HB acceptor is also present and this means that the HB donor contribution to polarity cannot be estimated directly using equations (5) and (6). Most of the values in Table 2 were derived from MMPA using Eq. (9). Availability of data made it possible to estimate polarity for hydroxyl, thiol and carboxylic acid HB donors by using equations (5) and (6) indirectly (e.g. as polarity difference between alcohols and ethers). Polarity was also estimated for the primary sulfonamide HB donors using equations (5) and (6) although this reflects lack of data for matched molecular pairs. One question that arises from this analysis concerns the extent to which alkylation of nitrogen or oxygen perturbs HB basicity although it is likely that donation of an HB to water will affect HB basicity in a similar manner.

Table 2 Polarity of hydrogen bond donors (HBD)

The polarity estimates in Table 2 suggest that hydrogen atoms interact more strongly with water when bonded to oxygen than when bonded to nitrogen. This is broadly consistent with logKα values typically observed [66] for amides, phenols and carboxylic acids although it is important to be aware that HB donation by hydroxyl is likely to result in an increase in HB basicity of oxygen [51]. The interactions of the HB donors of benzamides and anilides with water appear particularly weak, suggesting that methylation of these nitrogen atoms favors conformations in which the amide carbonyl oxygen atom can form more effective interactions with water. Analogous observations have been made for chromatographically-measured lipophilicity [106].

Although polarity differences can be discerned between the different types of HB donor, it is more instructive to compare them with polarity estimates for compounds with a single HB acceptor nitrogen or oxygen. The values of Q (in parentheses) for acetonitrile (3.5), 1-methylimidazole (5.5), 1-methylpiperidine (3.8), tetrahydrofuran (3.1), acetone (3.8), dimethylformamide (5.7), N-acetylpyrrolidine (6.8) and dimethylsulfoxide (7.0) suggest that the HB acceptors in these compounds are typically more polar than any of the HB donors in Table 1. Defining polarity in terms of logPalk enables HB donors and acceptors to be brought onto the same scale in a way that is not possible with measures of HB acidity and basicity derived from association constants for 1:1 hydrogen bonded complexes. These observations point to a general tendency for water to interact more strongly with HB acceptors than with HB donors and are consistent with the view that anions interact more strongly than cations with water [107109]. One question raised by the hydration imbalance between HB donors and acceptors concerns the extent to which it can be explained by the molecular (as opposed to the solvent) structure of water. The hydration imbalance between the HB donor and acceptor of the amide group should be considered when modelling protein folding and intramolecular hydrogen bonding of cyclic peptides.

Aromatic π-systems

Unlike other substructures used as illustrative examples, the HB capacity of π−systems cannot be linked to individual atoms. Aromatic hydrocarbons are more polar than saturated hydrocarbons and water is an order of magnitude more soluble in benzene than cyclohexane at temperatures ranging from 10 to 40 °C [27]. A Q value of 1.0 can be calculated for benzene using equations (5) and (6), indicating polarity comparable with the HB donor of an amide. An increase in the extent of the π-system typically leads to an increase in polarity although the Q values for phenanthrene (1.5) and pyrene (1.4) suggest that the trend is not particularly strong. The Q value for N-methylindole (2.5) indicates that this heterocycle is particularly polar and this is a factor that may need to be specifically accounted for when modelling interactions of tryptophan residues. The π-systems of aromatic rings function as HB acceptors and pKBHX values have been measured [68] for benzene (− 0.49) and 1-methylpyrrole (0.23). Figure 3a illustrates the relationship (M06, Table 1) between pKBHX and Vmin which can be used to predict pKBHX for the aromatic rings of chlorobenzene (−1.0) and 1,3-dichlorobenzene (−1.6). While it is well-established that aromatic π-systems can interact with HB donors, the key question in pharmaceutical design is whether the π-system of an aromatic ring interacts more or less strongly with its binding partner than with water.

The Vmin values associated with π-systems provide a measure of potential for interaction with HB donors and could be used as physicochemical descriptors of aromatic character [110]. Two pairs of MEP minima were observed for the π-system of indole and these are associated with the C4-C5 bond (Vmin = −0.035 au; calculated pKBHX = −0.06) and the C2–C3 bond (Vmin = −0.032 au; calculated pKBHX = −0.18). The MEP minima associated with the C4–C5 bond lie closer to C5 than C4 and it is significant that 5-azaindole is most basic of the azaindoles [111]. MEP calculations can be used to compare the effects of substitution and ring-fusion. For example, the Vmin value (−0.0005 au; predicted pKBHX = −2.15) calculated for buckminsterfullerene suggests that its π-system accepts hydrogen bonds even less readily (on a per-bond basis) than 1,3,5-trichlorobenzene (−0.0021 au; calculated pKBHX = −2.01). Two challenges for using pKBHX (or Vmin) to model aqueous solvation of π-systems are that numbers of interacting water molecules are not generally known and that HB basicity derived from data for 1:1 complexes is not directly relevant when a π-system accepts more than a single HB.

A plot of Q against Vmin is shown in Fig. 3b for a selection of non-fused aromatic compounds and a line (M07, Table 1) has been fit to the data for 1-methylpyrrole, benzene and the methylated benzenes. The chlorinated benzenes all lie above the reference line indicating that they are more polar than would be expected from Vmin values calculated for their π-systems. These results are consistent with a view that some of the lipophilicity increase associated with chloro-substitution is the result of a reduction in the HB basicity of the ring which would imply that chloro substituents on aromatic rings are less lipophilic than is commonly assumed [112, 113]. Additional support for this view comes from MMPA [99104] which shows that replacement of chloro with methyl for primary alkyl chlorides leads, on average, to a 1.4 unit increase in logPhxd (Table 3). In contrast, replacement of a chloro substituent on a benzene ring with a methyl group tends to result in a small decrease in logPhxd. MEP calculations suggest that the chlorine atoms of chlorobenzene (Vmin = −0.019 au) and dichloromethane (Vmin = −0.020 au) are of similar polarity. A single MEP minimum (Vmin = −0.024 au) was found for 1,2-dichlorobenzene and this indicates that, in contrast with dichloromethane, through-space interactions between the chlorine atoms are more important than through-bond interactions.

Table 3 Matched molecular pair analysis of effect on hexadecane/water logP of replacing of chloro with methyl

Aromatic nitrogen

Aromatic nitrogen is an important molecular recognition element in medicinal chemistry and the pKBHX and logKβ values measured [6668] for it span a wide range, indicating that this atom type is relatively sensitive to substructural context. This makes it more difficult to parameterize polarity by substructure and therefore increases the potential impact of a polarity model based on MEP. The relationship between Qcorr and calculated pKBHX (Model M08, Table 1) is shown in Fig. 4 for compounds with aromatic nitrogen HB acceptors. In this analysis, a substructural correction (for benzyl) was applied for a single measured logPalk value although two other values of Qcorr reflect scaling by the number of heteroaromatic nitrogen atoms. For modelling, the dataset has been restricted to molecular structures with one or more nitrogen atoms present in each aromatic ring and that are either unsubstituted or alkyl-substituted (e.g. 4-methylpyridine and 1,5-naphthyridine but not quinoxaline). The underlying assumption is polarity of an aza-substituted aromatic ring is dominated by the nitrogen so that the contribution of the π-cloud may be neglected. Making this assumption allows Qcorr to be equated to the substructural polarity, qaromN, of aromatic nitrogen for the training set compounds. 1-Benzylimidazole was included in the training set because the contribution to polarity of the benzyl group can be corrected for. An exponential function (Model M12, Table 1) was fitted to the data which allows qaromN to be calculated from Vmin. The rationale for fitting an exponential function is that the contribution of an HB acceptor to logPalk tends asymptotically to zero as the HB basicity becomes very weak. Values of Qcorr were also plotted in Fig. 4 for a number of compounds that had been excluded from the training set because of uncertainty about the contributions to polarity from substructures other than aromatic nitrogen. Fused five-membered heteroaromatic rings all lie above the fitted curve, indicating that other factors (e.g. presence of oxygen in ring; π-cloud polarity of carbocyclic ring) need to be considered when interpreting polarity for these compounds. The data for quinoline, isoquinoline and quinoxaline were not used for fitting M12 (Table 1), on account of the carbocyclic rings in their molecular structures. However, all lie close to the fitted curve which suggests that the carbocylic rings of these compounds make only small contributions to polarity. The pKBHX values calculated (Model M06, Table 1) for the carbocyclic rings of 1-methylbenzimidazole (−0.2), quinoline (−0.8), isoquinoline (−1.2) and quinoxaline (−1.3) may explain why the largest positive residual was observed for the first compound. Positive residuals were also observed for the halogenated species and this suggests that the polarity of the halogen atoms cannot be neglected. The pKBHX values calculated for the nitrogen (0.4) and each fluorine (−0.6) atom of 2,6-difluoropyridine suggest that the fluoro substituents significantly influence the polarity of this compound.

Equations (5) and (6) were used with calculated qaromN (M12, Table 1) to predict logPalk for a number of compounds for which the only substructures with HB capacity were aromatic nitrogen atoms (Fig. 5). Predicted and measured logPalk values were compared for five compounds with two or more non-equivalent heteroaromatic HB acceptors. The largest discrepancies between measurement and prediction were observed for 2 and 4 and, in each of these cases, the predicted value is less than the measured value which indicates that HB acceptor capacity has been over-estimated in the context of alkane/water partitioning. It is well known [63, 6668, 71, 114] that heteroaromatic compounds such as pyridazine (12) with adjacent nitrogen atoms are better HB acceptors than their proton basicity would suggest and this can be considered as a manifestation of the α effect [115] or thought of in terms of secondary electrostatic interactions [116]. While only 1:1 complexes are typically observed in the measurement of HB acidity or basicity, the HB donors and acceptors present in a molecular structure can all simultaneously form hydrogen bonds with water molecules in aqueous solution. HB donation to one of the nitrogen atoms of pyridazine would be expected to make it more difficult for the other nitrogen atom to accept an HB for a number of reasons. Firstly, accepting an HB makes nitrogen more electronegative and this will tend to draw electron density away from the other nitrogen atom. Secondly, simultaneous HB donation to both nitrogen atoms of pyridazine would result in an electrostatically repulsive orientation of water molecules that is enthalpically unfavorable. Thirdly, the orientation of two water molecules would increase the degree of constraint in the system and is therefore expected to be entropically unfavorable. It is noteworthy that the logPalk value calculated for 1 is very similar to the measured value and this observation is consistent with N3, which is predicted to be a significantly stronger HB acceptor than N2, dominating the solvation of this triazole.

The view that adjacency of HB acceptors compromises solvation has implications for molecular design and it can be conjectured that similar considerations apply to adjacent HB donors. The entropic costs of solvating adjacent polar atoms can also be thought of in terms of molecular complexity [117] and solvation can be described as ‘frustrated’ [118] when hydration spheres of polar atoms overlap to a significant extent. This implies that the presence of adjacent HB donors or acceptors in a concave region of a protein molecular surface should be viewed as a design opportunity [119]. It has also been suggested that molecular structures capable of presenting arrangements of hydrogen bonding groups that cannot easily be mimicked by clusters of water molecules represent a molecular recognition theme [63] that might be exploited in fragment design [120]. Measurement of logPalk for structurally prototypical compounds would allow frustrated hydration to be studied systematically.

Predictions for a number of heteroaromatic compounds for which experimental values have not been reported are also presented in Fig. 5 and the values calculated for 10 and 12 (but not 7) have been corrected (+2.0) for the presence of adjacent HB acceptors. The HB acceptors of 9 and 15 are predicted to be the weakest for the structures shown in Fig. 5 and measured logPalk values for these would be particularly informative for refining the model illustrated in Fig. 4. The cyclohexane/water partition coefficient component of the SAMPL5 challenge [33, 34] features compounds of higher molecular complexity than the structurally prototypical compounds typically encountered in the logPalk literature and this is certainly appropriate for testing prediction methods. Nevertheless, a case can be made for inclusion of structural prototypes that are likely to present specific challenges for solvation models. The prediction difficulties presented by strong HB acceptors that are aligned point to compounds of potential interest in initiatives like SAMPL5 [33, 34] and the HB acceptor characteristics of 1,8-naphthyridine and 1,2,3-triazine have already been highlighted in this context [63]. Prediction in drug design frequently focuses [99104] on differences between values of properties (e.g. decrease in solubility resulting from chloro-substitution) and this is a theme that could be explored in challenges such as SAMPL5 [33, 34]. For example, measured logPalk for pairs of compounds of identical molecular shape, but differing in their hydrogen bonding characteristics (e.g. 1-butyltetrazole and 2-butyltetrazole), would enable comparison of different solvation models with respect to their treatment of electrostatics.

Carbonyl oxygen

As is the case for aromatic nitrogen, the HB basicity of carbonyl oxygen is very sensitive to substructural context and it is therefore difficult to parameterize polarity for this atom type using substructural definitions. Oxygen atoms are typically associated with two HB acceptor sites that are not in general equivalent although this does not present special difficulties for modelling HB basicity because the experiments are designed so that only 1:1 complexes are observed [67, 68]. The situation is very different in solvents with HB donor capacity because an oxygen atom can simultaneously accept two hydrogen bonds and using one HB acceptor site is likely to result in a decrease in the HB basicity of the remaining site [51]. The situation is analogous to that of aligned HB acceptors of 2 and 4 discussed in the previous section, although each HB acceptor site is likely to be even more sensitive to the environment of the other. The approach used in this study was to model polarity using the greater of the two pKBHX values predicted for each carbonyl oxygen atom in cases where the two values differ and HB basicity of carbonyl oxygen has been treated in an analogous manner for prediction of ΔlogP [51]. As was the case for aromatic nitrogen, the training set was restricted to compounds for which the polarity of the carbonyl oxygen could be estimated from measured logPalk. In cases where the carbonyl group is part of an extended, non-fused, π-system (e.g. tertiary amides and benzoquinone but not naphthoquinone) substructural polarity is assumed to be due to the carbonyl oxygen. Three quinolones with benzylic substituents on nitrogen were also included in the training set because their inclusion improves coverage of chemical space and the polarity of substituents can be accounted for. The relationship between Qcorr and pKBHX predicted using M09 (Table 1) is shown in Fig. 6 for compounds with carbonyl or sulfoxide oxygen as the only atoms with HB capacity in their molecular structures. The data points for the sulfoxides were not used for modelling and all lie below the curve (M13, Table 1) that has been fit which suggests that predicted pKBHX exaggerates the polarity of sulfoxide oxygen.

Equations (5) and (6) were used to predict logPalk using models M12 (aromatic nitrogen), M13 (carbonyl oxygen), M14 (hydroxyl donating intramolecular HB) and the qHBD values in Table 2 (HB donors). The results are shown in Fig. 7 and agreement between predicted and measured values is poorer than what might be expected from the root mean square error (RMSE) for model M13 (Table 1) which highlights the difficulties in extrapolating from structural prototypes to situations where HB acceptors and/or donors are in close proximity. The predictions tend to exaggerate the polarity of these compounds and predicted logPalk values are typically lower than the measured values. As noted in the previous section, simultaneous solvation of adjacent hydrogen bonding sites is likely to incur, at very least, an entropic cost and Vmin does not capture the polarization of a solute that accepts hydrogen bonds from one or more water molecules. The discrepancies between predicted and measured logPalk are particularly extreme for 21, 22, 24 and 25 which may reflect a structural feature (carbonyl group adjacent to doubly-connected nitrogen) that is shared by these compounds. However, a more subtle factor may also be exerting its influence here. The carbonyl oxygen atoms for the compounds in the training set typically have HB acceptor sites for which the calculated pKBHX values are either identical or, at least, very similar. In contrast, the pKBHX values calculated for the HB acceptor sites of 22 (3.4 and 2.2) differ by 1.2. This raises a more general question for quantitative structure activity/property relationship (QSAR/QSPR) modelling. Suppose two descriptors X1 and X2 are strongly correlated for the training set compounds. Should a set of compounds for which X1 and X2 are weakly correlated be considered to be within the same region of chemical space as the training set simply because the values of all descriptors used in the model lie within the ranges of training set values?

Fig. 7
figure 7

Calculated logPalk for compounds with carbonyl oxygen HB acceptors. Substructural polarity associated with atoms were calculated from Vmin using models M12, M13 or M14 (Table 1) or taken from Table 2. Each logPalk value is calculated by subtracting substructural polarity values from a reference logPalk value calculated (model M01, Table 1) for a saturated hydrocarbon with the same molecular surface area. Hydroxyl groups forming intramolecular hydrogen bonds were modelled (M14, Table 1) as ethers and the participating HB donors were treated as non-polar

Values of logPalk have been calculated for three compounds for which intramolecular hydrogen bonding is likely to influence partitioning characteristics. The calculated logPalk values are all lower (by 0.3 to 1.1 unit) than the measured values which indicates that the polarity of the compounds has been over-estimated. Formation of an intramolecular HB eliminates one of the MEP minima associated with carbonyl oxygen and it could be argued that this would place the compound outside the applicability domain of a model trained with data for carbonyl groups with pairs of very similar Vmin values. Nevertheless, the MEP calculations capture essential features of the intramolecular HB such as the reduced availability of the remaining oxygen ‘lone pair’. The intramolecular HBs for these three compounds are likely to persist in the aqueous phase and this would be expected to facilitate prediction of logPalk, especially for a method like ClogPalk that uses a single conformation to represent a structure.

Modeling logPoct for aza analogs of benzene and naphthalene

Although alkane/water partition coefficients represent the main focus of this study, hydrogen bonding also influences their octanol/water equivalents. Figure 8 illustrates the relationship between the effect of aza-substitution on logPoct and the pKBHX calculated for nitrogen. The analysis has been performed on a per-nitrogen basis and the data points fall into two groups according to whether or not a carbocyclic ring is present in the molecular structure of the aza-analog. The small residuals observed for phthalazine and cinnoline suggest that proximity of HB acceptors is much less of a problem for prediction of logPoct than for logPalk. Calculated values of logPoct for aza analogs of benzene and naphthalene are shown in Fig. 9. On a technical note, aza-analogs of benzene and naphthalene should be considered outside the applicability domains of these models if they are substituted (even with alkyl).

Fig. 8
figure 8

Relationship between effect on logPoct of aza substitution of benzene or naphthalene and pKBHX calculated for aromatic nitrogen using model M08 (Table 1). Differences in logPoct values were scaled by number of aza substitutions and points are grouped according to whether or not the aza-substituted species is fused with a carbocycle (1,5-naphthyridine is grouped with the aza-benzenes in green). The data was fitted using models M15 (green) and M16 (red) in Table 1

Fig. 9
figure 9

Prediction of logPoct for some six-membered heteroaromatic rings. Substructural polarity (qaromN) associated with nitrogen atoms values were calculated from Vmin using models M12 (Table 1). Each logPoct value is calculated by subtracting qaromN values from measured logPoct for either benzene (2.13) or naphthalene (3.30)

One aspect of lipophilicity control in molecular design is to achieve a balance between the polar and non-polar portions of molecular structures. The logPoct values for benzene (2.1) and 4-propylpyridine (2.1) suggest that aza-substitution of benzene will counter the effect of a propyl substituent. However, in the hexadecane/water partitioning system, 4-propylpyridine (logPhxd = 1.3) is 0.8 units less lipophilic than benzene (logPhxd = 2.1) suggesting that aza-substitution will more than compensate for the presence of a propyl group. Differences like these raise the question of which partitioning system is ‘right’ for lead optimization and even whether there is a single ‘right’ partitioning system for all applications. Despite its limitations, logPoct is likely to remain a useful design parameter for lead optimization and knowledge of HB acidity and basicity can help the medicinal chemist minimize the impact of the limitations. Lead optimization is usually carried out against structural series that are defined by scaffolds and HB acceptors/donors (and ionizable groups) tend to be relatively conserved within series. This means that the choice of partitioning system becomes less important when working within a series than when performing data analysis for structurally diverse sets of compounds [20]. If a plot of pIC50 against logPoct shows a compound to be deviating sharply from the trend line, it is advisable to assess the hydrogen bonding characteristics of the compound before jumping to the conclusion that the observed potency is especially unusual. The medicinal chemist should also be cautious when attempting to extrapolate trends (e.g. response of aqueous solubility to logPoct) observed for one series to another series and be especially wary of any analysis in which continuous data has been transformed to categorical data [12].

Conclusions

We show how logPalk measurements can be analyzed to define polarity for both compounds and substructures. Using a number of illustrative examples, we make a connection between polarity defined in terms of partitioning and hydrogen bonding defined in terms of 1:1 complex stability. Defining polarity in this way highlights the hydration imbalance between the HB donor and acceptor of the amide group. Two insights relevant to molecular design are that aromatic chloro substituents may be less hydrophobic that is commonly believed and that hydration of adjacent HB acceptors (or donors) is likely to be frustrated. We show how pKBHX values calculated for aromatic nitrogen and carbonyl oxygen can be used in prediction of partition coefficients.