Introduction

Hydration free energies have been of substantial interest to the molecular simulations and computer-aided drug discovery communities for many years. These free energies describe the transfer of small molecules between gas to water, or their relative populations in gas and water at equilibrium. This interest stems from both practical and scientific reasons. Water is of considerable interest as a solvent, and these free energies can be used to probe aspects of solvation we do not yet understand [2, 6, 9, 10, 30]. Furthermore, since biomolecular binding interactions involve at least partial transfer of a molecular ligand from solution into a binding site, our ability to accurately model solvation and desolvation is thought to provide insight into the level of accuracy we could expect under ideal circumstances in a binding free energy calculation. That is, we should not expect to have substantially higher accuracy in binding calculations than we can when computing hydration free energies. At a more practical level, these calculations are interesting in part simply because they can be calculated extremely precisely from molecular simulations for many small molecules [32, 49], enabling quantitative comparison to experiment. This comparison can provide insight into where and how to improve our underlying solvation models and force fields [11, 21, 23, 24, 32, 35, 37, 5052].

For these reasons, the Mobley lab has spent a good deal of effort on hydration free energy calculations. Our approach to calculating these typically involves alchemical free energy calculations based on classical molecular dynamics (MD) simulations [5, 7, 29, 48], usually with a fixed-charge force field in explicit solvent. While other methods such as implicit-solvent calculations [33, 40, 45, 50] and MD simulations based on polarizable force fields [42, 43] or QM-MM approaches [60] are also of considerable interest, this has not been a major emphasis of our work.

Because of our interest in all-atom MD simulations, we previously compiled a database of roughly 504 neutral small molecules with experimental hydration free energies, and we computed hydration free energies of all of these compounds in both implicit solvent [33] and explicit solvent [32] using the GAFF small molecule force field [58, 59], AM1-BCC partial charges [17, 18], and the AMBER (implicit solvent case) [3] and GROMACS (explicit solvent case) [54] simulation packages. This dataset, typically called the “504 molecule set” or the “Mobley set”, has seen substantial use as a benchmark and test set in a reasonably wide variety of applications. We attribute this use partly to the substantial size of the set, but also partly because it includes both experimental and calculated values for all of the compounds, as well as input files. So, for example, it has been used to test and/or train implicit solvent models to reproduce explicit solvent results with the same parameters, as well as for direct comparisons of new or existing force fields against experiment [1, 8, 11, 13, 24, 25, 27, 28, 38, 55, 57].

While this previous set, which we here call “the 2008 set”, has been useful, it has several deficiencies. First, there are several errors in the set itself, in terms of duplicate compounds, incorrect values, and so on. While these issues are being corrected via an erratum, it seems likely that further updates will be needed in the future (especially if new experiments begin being done), and there is no obvious mechanism for keeping the database updated when its main repository is the Supporting Information of a particular paper. Second, the format is less than ideal (in that much of the key information is embedded in PDF files within the Supporting Information), making it difficult to deal with in an automated manner. While we have provided this information in alternate formats such as plain text to individual researchers, this is hardly an ideal solution. Third, we now have additional experimental and calculated valuesFootnote 1 and we would like to extend the set to include these. Fourth, an ideal database would also include additional information to improve ease-of-use, such additional compound identifiers like SMILES strings or identifiers from other databases such as PubChem, and better handling of experimental sources. Finally, an ideal database should be extensible in a straightforward manner.

To improve on the current situation, we have moved our database online to a permanent, cite-able URL (http://www.escholarship.org/uc/item/6sd403pz) and simultaneously updated, expanded, and curated the set, also adding additional, smaller sets we have studied previously and since. This paper reports on the update and curation process. The final product includes a variety of changes described below, to deal with limitations of the previous database. Additionally, the database is now versioned. While one specific version of the database is deposited in the Supporting Information associated with this paper, the full database now has a permanent, cite-able repository online which will allow further updates. Here, we describe our curation and construction process for this database, which we call the “Free Solvation Database” or FreeSolv.

Database construction

Starting points

The starting point in constructing the FreeSolv database was to pull together all of the lead author’s previous work calculating hydration free energies in explicit solvent. This included calculated values, experimental values, and structures and input filesFootnote 2 from several previous studies [22, 3136, 39]. To simplify the following discussion, we will refer to the set represented in each study by one of the author’s namesFootnote 3, except for the large 2008 set [32, 33] as noted above. Specifically, we drew on the Dumont set [34], the Nicholls set [39], the 2008 set [32, 33], the Mobley set [31], the Klimovich set [22], the Liu set [35], and the Wymer set [36].

For all of these sets except one, we had retained not only calculated and experimental hydration free energies and original coordinate files (.mol2 format) containing geometries and partial charges, but also input files in the form of GROMACS topology and coordinate files. However, for the Nicholls set, we no longer had topology and coordinate files, so these were re-generated using Antechamber and ACPYPE [53].

After pulling together all these files, we found we had source files for 736 compounds. However, no cross-checking had been done at this point to ensure uniqueness of compounds. Uniqueness will be addressed below.

It is worth highlighting that this database contains only neutral solutesFootnote 4. This is driven by two main considerations. In part, a variety of technical issues make alchemical free energy calculations for charged solutes extremely challenging [19, 20, 46] and we have only recently begun to understand the necessary corrections. Secondarily, experimental measurements of ionic hydration free energies are typically not possible, and typically must be obtained from decomposing solvation of ionic pairs into solvation of the individual compounds. This step can involve assumptions which are controversial. Hence, here, our focus has been on hydration free energies of neutral compounds. It is worth noting, however, that the Rizzo lab database [45] (http://ringo.ams.sunysb.edu/index.php/Rizzo_Lab_Downloads) contains in excess of 50 ions, including monoatomic and polyatomic ions, so the interested reader is referred there.

Error correction

We were already aware of several errors which we corrected in construction of the FreeSolv set. These will also be addressed in errata to the relevant individual studies. Specifically:

  • A human error had resulted in an incorrect structure and name (triacetyl glycerol) of the molecule which was intended to be triacetin/glycerol triacetate, in the 2008 set [33]. This compound had originated from the Nicholls set [39], where it was correct. The incorrect structure/name is now removed but the correct molecule from the Nicholls set is retained.

  • The experimental value for hexafluoropropene was corrected from −3.76 to 2.31 kcal/mol; it had incorrectly been assigned the value for hexafluoro-propan-2-ol due to human error interpreting abbreviations in reference [45], as per personal communication [44].

  • Several duplicates within the 2008 set [33] were removed, including 2-methylbut-2-ene under slight variants of the same name, 3-methylbut-1-ene in similar circumstances, and benzonitrile which is equivalent to cyanobenzene.

  • From the 2008 set [33], we removed a duplicate butanal entry which had an incorrect experimental value

  • The molecule labeled pentan-2-one in the Dumont set [34] was actually pentan-3-one, so the name and experimental value were updated to reflect the correct compound

  • The molecules labeled “lindane” and “prometryn” from the Mobley set were removed because of incorrect stereochemistry in the former case, and a swap between a dimethyl and an ethyl in the latter case. This issue appears to have originated in conversion of .xyz format files to 3D structures when the organizers were preparing for the statistical assessment of modeling of proteins and ligands challenge [14], and will likely require errata to several papers utilizing the relevant set [14]. This was caught during the curation process discussed below.

Initial construction process

While ideally each compound might be identified by its IUPAC name or SMILES string, different schemes for constructing these can lead to different names or strings. Every compound in the set needs a unique identifier, however, so our first step in updating the set was to assign each compound a compound identifier, consisting of the prefix “mobley_” followed by a unique random integer between 0 and 1 billion. These compound IDs serve as the basic identifiers of compounds in the set, and also serve as file names for structures and molecule files. These IDs were assigned automatically via Python script.

Once compound identifiers were assigned, we pulled experimental and calculated values, as well as their uncertainties (when applicable—experimental uncertainties were not always available) and names (some followed IUPAC conventions; others did not) from the sets studied previously via custom Python scripts, with one script handling each prior database separately (since data formats differed). The resulting data was stored into a Python dictionary, keyed by compound ID, along with separate digital object identifiers (DOIs) for the sources of the experimental and calculated values. Our Python scripts also organized the supporting files (3D structures and parameter files), ensuring we had .mol2 files with both SYBYL and GAFF atom naming conventions for each molecule, and organizing the appropriate GROMACS topology and coordinate files. As noted above, in the case of the Nicholls set [39], the relevant script also re-generated topology files. A note of this was added to the ’notes’ field in the database for each of the affected compounds.

Curation process

Following initial construction of the database, we used a Python script drawing on OpenEye software’s Python toolkits [41] to curate the database.

Before doing anything else, this script removed the entry corresponding to 4-nitroaniline from the 2008 set [33], since the Mobley set [31] had this as well with an experimental value which had been more carefully curated [14].

After this, we used OpenEye tools to attempt to parse all of the compound names. Any names which did not parse correctly at this stage were flagged for attention, and these were typically dealt with in one of two ways. First, some of the failures were because stereochemistry information was unspecified by the compound name, but specified in our existing 3D structures. In these cases (1,2-dichloroethylene, nerol) we re-generated IUPAC names from the 3D structure using OpenEye tools. Second, the remaining cases were dealt with manually. There seemed to be several major sources of problems. There were a handful of typos (5-flurouracil rather than 5-fluorouracil, for example), and a variety of other cases where a common name had been used for the compound which was not recognized by the OpenEye toolkits (carbaryl, trifluralin, pirimor, etc.). The Mobley set [14, 31] was the origin of many of these. These were typically resolved by finding alternate names. Our default procedure was to generate the compound from its common name in MarvinSketch [4], and then compute an IUPAC name within MarvinSketch and check if the OpenEye toolkit could parse it back into the correct structure. When this procedure failed, we resorted to searching Wikipedia or PubChem for alternate compound names and checking that we obtained one which the OpenEye toolkits could parse back into the correct structure. In any case where the IUPAC name was edited as described here, a note to this effect was added in the ’notes’ field of the database. All compound names were stored to the ‘iupac’ field in the database, though not all of these are technically IUPAC names. Additionally, alternate IUPAC names were assigned manually in two additional cases when PubChem lookup (discussed in Section 2.5, below) by the name failed. Specifically, mobley_2636578, 1,3-bis-(nitrooxy)propane, was renamed as 3-nitrooxypropyl nitrate, and mobley_819018, trans-3,7-dimethylocta-2,6-dien-1-ol, was renamed as (2E)-3,7-dimethylocta-2,6-dien-1-ol.

Following this check of compound names, we then generated canonical isomeric SMILES strings for each compound from the 3D structure and stored this to the database. We also then generated an analogous SMILES string for each compound from its stored name. In any case where SMILES generation from the name failed, a new name was generated from the 3D structure and stored, with the ‘notes’ field updated accordingly. In cases where SMILES were generated from both the name and the 3D structure (the vast majority of cases), we cross-checked these and ensured that they matched. This was the step where we caught the errors relating to lindane and prometryn noted above. Aside from that, no errors were found at this step.

Since for the vast majority of compounds, we now had two isomeric SMILES strings—one generated from the name, and one from the 3D structure—this provided an ideal opportunity check for redundancy in the set. Many compounds at this point appeared multiple times. For example, almost all of the compounds from the Dumont set [34] also appeared in the 2008 set [32, 33]. Some of the compounds from the 2008 set appeared in later sets as well. Thus, our next step was to remove duplicate compounds. This was made slightly more difficult by the fact that in some cases, the experimental data had a different origin (typically because an alternate name for the compound had led us to overlook the duplication initially), and thus the experimental values were potentially different. We dealt with this by identifying compounds which were identical (i.e. their canonical isomeric SMILES strings or chemical names were equivalent) and cross-checking their experimental values. In any case where the difference in experimental values was larger than the tabulated experimental uncertainty, the case was flagged for further investigation. This was not true for any of the compounds in the set except 4-nitroaniline, which occurred in both the 2008 and Mobley sets [14, 31]. After investigation, it was concluded that the later value is probably superior and this was retained. The remaining duplicates, where differences were not statistically significant (approximately 72), were removed from the set automatically.

In separate work, J. Peter Guthrie is compiling an extensive, carefully curated database of experimental hydration free energies. We cross-compared experimental values in our set to a pre-release version of the Guthrie database, and flagged discrepancies above 1 kcal/mol. (The number of discrepancies below 1 kcal/mol numbered over 100, and falls within the scope of Guthrie’s database curation work rather than the scope of this paper). In these cases we obtained details of the data from Guthrie and in some cases updated experimental values and references. When we did so, this is shown in the ’notes’ field of the database. This was true for 4-propylphenol, 4-bromophenol, 3-hydroxybenzaldehyde, 2-methoxyethanol, (2E)-hex-2-enal, and dimethyl sulfoxide/ methylsulfinylmethane.

Additionally, after consultation with Guthrie, we removed a series of sulfonylurea compounds from the Mobley set [14, 31], because of concerns about the quality of the underlying vapor pressure measurements, especially Figs. 2, 3, 4, 5 of reference [47]. Specifically, we removed the compounds called sulfometuron-methyl, metsulfuronmethyl, chlorimuronethyl, thifensulfuron, and bensulfuron. Unfortunately this means that we now only have two sulfones in our set, and in general have far too few sulfur-containing compounds, as we discuss below.

We also updated the experimental details for 1,3-butadiene. Specifically, we updated the reference to point to the original experimental data of Hine and Mookerjee [16], and updated our previous hydration free energy of 0.6–0.65 kcal/mol. As pointed out by Christopher I. Bayly in personal correspondence, the raw data there for activity coefficients in gas and water (\(-\log c_g = 1.39\) and \(-\log c_w = 1.87\)) leads to a difference of −0.48 rather than the stated value of −0.41, which is apparently a typo. The former leads to a hydration free energy of 0.65 kcal/mol, the correct value, while the latter would yield 0.56 kcal/mol.

As a final step, we also generated SDF format files for all of the molecules in the set using the OpenEye toolkits. These supplement the .mol2 files we already had available.

Any further curation done will be documented in the database documentation distributed with each database version.

Annotation

In the past, we have found it useful to focus analysis on just a fraction of the database, such as by examining systematic errors organized by functional group[32]. To aid further such analysis, we used Checkmol [15] to assign functional groups to all of the compounds in the set. The resulting functional group identifiers were stored to the database in the ‘groups’ field.

We also decided to link compounds in our set to alternate databases to simplify future work relating to compound identification, so we chose PubChem compound identifiers as an alternate way of referencing compounds. We assigned PubChem compound IDs to all of the compounds in our set using PubChemPy [56] automatically. Our script first attempted lookup by the assigned compound name (usually IUPAC name) and in cases where this did not result in a match in PubChem, it fell back to lookup via SMILES string. In several cases, typically due to unspecified stereochemistry in PubChem, we had to assign a PubChem ID manually. This was the case for mobley_6843802 ([(1R)-1,2,2-trifluoroethoxy]benzene); mobley_7869158, [(2S)-butan-2-yl] nitrate; and mobley_9741965, 1,3-bis-(nitrooxy)butane. PubChem IDs are thus stored in the database for all compounds in the set.

Database format

Currently, the database is stored within Python as a dictionary, keyed by compound ID, with each compound having keys for the various entries (SMILES string, experimental value and uncertainty, calculated value and uncertainty, (IUPAC) name, functional groups, PubChem ID, and notes). This database is then stored as a Python pickle file, and in a semicolon delimited text file. In the latter format, functional groups are stored to a separate file, groups.txt, to ensure the number of fields in the database text file is manageable. The semicolon delimited format was chosen because other common delimiters (spaces, commas) often occur in compound names making them unsuitable as delimiters.

Database contents

Currently, the database contains 643 neutral compounds which can mostly be considered fragment-like from a drug discovery perspective. The range in molecular weight from methane (16.04 Daltons, compound mobley_9055303) to 1,2,3,4,5-pentachloro-6-(2,3,4,5,6-pentachlorophenyl)benzene (that is, decachlorobiphenyl, at 498.66 Daltons, compound mobley_5456566) (Fig. 1). The compounds also span a range of polarities. While experimental dipole moments are not part of our data set, we can compute dipole moments based on the AM1-BCC partial charges assigned to molecules, and we find that dipole moments range from 0.0 (methane and many others) to 7.14 for 4-nitroaniline (mobley_6082662). Experimental hydration free energies cover a range of approximately 29 kcal/mol, from 3.43 kcal/mol for octafluorocyclobutane (mobley_1723043) to −25.47 kcal/mol for (2R,3R,4S,5S,6R)-6-(hydroxymethyl)tetrahydropyran-2,3,4,5-tetrolFootnote 5 (mobley_9534740). Calculated hydration free energies range from 3.43 kcal/mol for decane (mobley_2197088) to −21.71 kcal/mol for cyanuric acid (mobley_6239320). The distribution of these properties is shown in Fig. 2.

Fig. 1
figure 1

Shown are compounds representing some of the extrema in the set. 1,2,3,4,5-pentachloro-6-(2,3,4,5,6-pentachlorophenyl)benzene (mobley_4546566) has the largest molecular weight, while methane has the smallest. Methane, among others, has the smallest dipole moment, while 4-nitroaniline (mobley_6082662) has the largest. Experimental hydration free energies range from 3.43 kcal/mol for octafluorocyclobutane (mobley_1723043) to −25.47 kcal/mol for (2R,3R,4S,5S,6R)-6-(hydroxymethyl)tetrahydropyran-2,3,4,5-tetrol (mobley_9534740), while calculated values range from 3.43 kcal/mol for decane (mobley_2197088) to −21.71 kcal/mol for cyanuric acid (mobley_6239320). a mobley_5456566. b mobley_6082662. c mobley_1723043. d mobley_9534740. e mobley_2197088. f mobley_6239320

Fig. 2
figure 2

Distributions of molecular weight, dipole moment, and hydration free energies for the set described here. a Molecular weight distribution. b Dipole moment distribution. c Experimental hydration free energy. d Calculated hydration free energy

While calculated and experimental hydration free energies for the compounds in this set have been compared before, this analysis is spread across several studies and aggregate statistics are not available. Figure 3 compares calculated and experimental values for the set. Here, we find an overall average error of 0.47 ± 0.06 kcal/mol, an RMS error of 1.51 ± 0.07 kcal/mol, an average unsigned error of kcal/mol, a Kendall τ of 0.80±0.01, and a Pearson R of 0.94±0.01.

Fig. 3
figure 3

Calculated versus experimental hydration free energies for the compounds in the set. Error bars are present for both calculated and experimental values, but statistical uncertainties in the calculated values are extremely small, which typically makes it difficult to see the error bars

As noted previously [25, 32], having such a large set of data makes it possible to look for systematic errors in the force field description of particular functional groups. This can also be seen in Fig. 4, where we look at the average unsigned error by functional group (as assigned by Checkmol)Footnote 6. Previously, we have used information from similar tests to isolate systematic errors for alkynes [32] and alcohols [12] and taken some steps towards addressing these issues. However, further work in this direction is needed, as it seems fairly clear that some functional groups tend to have particularly large errors.

Fig. 4
figure 4

Average unsigned error by functional group. Shown is the average unsigned error for compounds in the set by functional group (as assigned by Checkmol) for functional groups represented in at least five compounds in the set. Alcohols tend to be particularly problematic, as we are addressing elsewhere [12], but a variety of other functional groups appear particularly challenging as well. Error bars were computed via 10,000 iterations of a bootstrapping procedure described elsewhere [36], where we construct new data sets with replacement while resampling the experimental data with Gaussian noise and look at the SD over trials

One reason hydration free energies are of such interest is that they provide a test of potential relevance to binding affinity calculations for drug discovery. But is this set relevant to drug discovery? The typical size of molecules in the set is substantially smaller than typical small-molecule drugs. As noted, many of these molecules are more like “fragments” than drugs. But this may not be a problem as long as we cover all the common chemical functionalities found in drug molecules. For example, if we know that each hydroxyl group typically leads to a systematic error of just over 1 kcal/mol in fragment-like molecules [12], there is no reason to assume the error should be more or less in larger, drug-like molecules. But if there are some functional groups which frequently occur in drug-like molecules but are missing from the present set, then we have very little insight into what level of performance to expect on compounds containing these functional groups.

To compare functional group representation in typical drugs with that in our set, we downloaded the set of small molecule drugs from DrugBank 3.0 [26]. This contains over 1,500 approved small-molecule drugs and a larger number of experimental drugs, with some 6,583 molecules in total. We then compared the functional group distribution seen in these molecules with that represented in our set (Figure 5)Footnote 7 On the whole, results are mixed. The present set does cover a reasonably broad range of functional groups, and even has more of some functional groups than in typical drugs (chlorinated compounds are a good example of this). But some functional groups are underrepresented by far or do not appear at all, such as aminals/hemiaminals, boronic acid and boronic acid esters, enamines, enols, enol ethers, hemithioaminals, and many sulfur-containing compounds, especially sulfonamides, sulfonic acids, sulfuric acid monoesters, and thiocarboxylic acid esters. If we want to truly understand how our methods can do at predicting thermodynamic properties for molecules containing these functional groups, we will need more data. These classes of compounds are also particularly concerning in that they are further away from the region of chemical space we have studied the most—specifically, current biomolecular force fields have typically started with proteins and sometimes nucleic acids and branched out from there. As we move further from that region of chemical space, we know less about how well we can expect our force fields to work. And thus we particularly need more data for these types of compounds.

Fig. 5
figure 5

Distribution of functional groups in DrugBank versus our dataset. At top is the distribution of functional groups (assigned by checkmol) in DrugBank, and at bottom, the distribution of functional groups in our small-molecule hydration set. Functional groups with fewer than 30 occurrences in DrugBank are excluded for space reasons, and a variety of other functional groups have been merged or skipped as described in the text, again for space reasons. The abbreviation “ca” is short for carboxylic acid

Conclusions

Here, we provide FreeSolv, an updated database of calculated and experimental hydration free energies for a large set of 643 neutral molecules which are mostly fragment-like. This database is freely available at http://www.escholarship.org/uc/item/6sd403pz and updates will be posted there when available.

While this database builds on our previously published work, it corrects a number of errors and redundancies and is more carefully curated. It is also designed to allow easy automated use via programs and scripts, and contains a variety of supporting files including molecular structures, topology and coordinate files, parameter files, and so on. We also provide SMILES strings and PubChem compound IDs for all the compounds in the set to allow easier cross-linking to other sources of chemical information.

We hope that the availability of the FreeSolv dataset will drive future force field development, development and testing of new methods, and potentially even new experimental work to fill in gaps in the available data. For example, we have highlighted functional groups which are common in drugs, and which are underrepresented or not present in this set.

Supporting Information

In the Supporting Information, we provide version 0.3 of the FreeSolv database, released Feb. 3, 2014, and a PDF file detailing changes leading up to this database.