1 Introduction

Due to the increasing computational power provided by supercomputers and recent advances in the development of economical ab initio methods (e.g., advances in explicitly correlated techniques [14]), high-level ab initio methods have now been refined to the point where they are applicable to biologically relevant systems (see Refs. [513] for some recent studies). Proteinogenic amino acids are the most basic building blocks of proteins and play key roles in protein structure and function. They also serve as precursors of many biologically relevant molecules, such as polypeptides, nucleotides, hormones, neurotransmitters, and antioxidants [14]. Despite their importance, the experimental gas-phase heats of formation of most of the natural amino acids are not accurately known. Determination of these fundamental thermochemical quantities may be important in understanding why nature chose these molecules as fundamental biological building blocks—for example, by comparing the relative stabilities of \(\alpha\)- versus \(\beta\)-amino acids [15, 16]. Accurate heats of formation for the amino acids are also important from a theoretical point of view, e.g., for the validation and parameterization of computationally cost-effective procedures such as density functional theory, semiempirical molecular orbital theory, and molecular mechanics. In recent years, a large number of theoretical studies were dedicated to obtaining thermochemical properties of amino acids using high-level theoretical procedures [1525].

In the present work, we obtain accurate theoretical heats of formation for the lowest-energy conformers for the 18 proteinogenic amino acids using the high-level, ab initio W1-F12 and W2-F12 thermochemical procedures [31]. These thermochemical procedures represent layered extrapolations to the all-electron, relativistic CCSD(T)/CBS energy (complete basis set limit coupled cluster with singles, doubles, and quasiperturbative triple excitations) and can achieve an accuracy in the sub-kcal/mol range for molecules whose wave functions are dominated by dynamical correlation [31, 32]. We use these benchmark values to evaluate the performance of a variety of G\(n\)-type procedures [33] that were recently used for obtaining accurate thermochemical properties of amino acids [1520].

The present paper also seeks to pay tribute to the scientific achievements of Prof. Isaiah Shavitt (OBM) and specifically to his seminal contributions to coupled cluster theory [26], to the theory and development of Gaussian basis sets [27], to accurate applied quantum chemistry [2830], and to computational biochemistry [5].

2 Computational details

Most calculations were run on the CRUNTCh (Computational Research at UNT in Chemistry) Linux farm at the University of North Texas, on the high-performance computing National Computational Infrastructure (NCI) National Facility at Canberra, and on the iVEC@UWA facilities. Some additional calculations were carried out on the Faculty of Chemistry Linux farm at the Weizmann Institute of Science.

The geometries have been optimized at the B3LYP/A’VTZ level of theory [3436] (where A’VTZ indicates the combination of the standard correlation-consistent cc-pVTZ basis set on hydrogen, [37] the aug-cc-pVTZ basis set on first-row elements, [38] and the aug-cc-pV(T+d)Z basis set on sulfur) [39]. All geometry optimizations and frequency calculations were performed using the Gaussian 09 program suite [40]. Benchmark relativistic, all-electron CCSD(T)/CBS energies were then obtained by means of our recently developed W1-F12 and W2-F12 thermochemical protocols [31] using the Molpro 2012.1 program suite [41]. The computational protocols of W1-F12 and W2-F12 theories have been specified and rationalized in reference [31].

In W1-F12 theory, the Hartree–Fock component is extrapolated from the VDZ-F12 and VTZ-F12 basis sets, using the \(E(L) = E_{\infty } + \hbox {A}/L^{\alpha }\) two-point extrapolation formula with \(\alpha\) =  5 (where \(L\) is the highest angular momentum represented in the basis set, and V\(n\)Z-F12 denotes the cc-pV\(n\)Z-F12 basis sets of Peterson et al. [42] which were developed for explicitly correlated calculations). Optimal values for the geminal Slater exponents (\(\beta\)) used in conjunction with the V\(n\)Z-F12 basis sets were taken from reference [43]. The valence CCSD-F12 correlation energy is extrapolated from the same basis sets, using the said two-point extrapolation formula. Extrapolation exponents (\(\alpha\)) were taken from references [31, 43]. In all of the explicitly correlated coupled cluster calculations the diagonal, fixed-amplitude 3C(FIX) ansatz [4547] and the CCSD-F12b approximation [48, 49] are employed. The (T) valence correlation energy is obtained in the same way as in the original Weizmann-1 (W1) theory, [50] i.e., extrapolated from the A’VDZ and A’VTZ basis sets using the above two-point extrapolation formula with \(\alpha\) = 3.22. The CCSD inner-shell contribution is calculated with the core-valence weighted correlation-consistent A’PWCVTZ basis set of Peterson and Dunning, [51] while the (T) inner-shell contribution is calculated with the PWCVTZ(no \(f\)) basis set (where A’PWCVTZ indicates the combination of the cc-pVTZ basis set on hydrogen and the aug-cc-pwCVTZ basis set on carbon, and PWCVTZ(no \(f\)) indicates the cc-pwCVTZ basis set without the \(f\) functions). The scalar relativistic contribution (in the second-order Douglas–Kroll–Hess approximation [52, 53]) is obtained as the difference between non-relativistic CCSD(T)/A’VDZ and relativistic CCSD(T)/A’VDZ-DK calculations [54] (where A’VDZ-DK indicates the combination of the cc-pVDZ-DK basis set on H and aug-cc-pV(D+d)Z-DK basis set on heavier elements). The atomic spin–orbit coupling terms are taken from the experimental fine structure, and the diagonal Born–Oppenheimer correction (DBOC) is calculated at the HF/A’VTZ level of theory. The zero-point vibrational energies (ZPVEs) are derived from B2PLYP/def2-TZVPP harmonic frequencies (and scaled by 0.9833, see Sect. 3.3).

In W2-F12, the Hartree–Fock component is calculated with the VQZ-F12 basis set. The valence CCSD-F12 correlation energy is extrapolated from the VTZ-F12 and VQZ-F12 basis sets, using the above two-point extrapolation formula with \(\alpha\) = 5.94. The quasiperturbative triples, (T), corrections are obtained from standard CCSD(T)/VTZ-F12 calculations (i.e., without inclusion of F12 terms) and scaled by the factor f = \(0.987 \times E_{\text {MP2-F12}}/E_{\text {MP2}}\). This approach has been shown to accelerate the basis set convergence [31, 49]. The CCSD inner-shell contribution is calculated with the core-valence weighted correlation-consistent A’PWCVTZ basis set, while the (T) inner-shell contribution is calculated with the PWCVTZ(no f) basis set. The scalar relativistic, spin–orbit coupling, DBOC, and ZPVE corrections are obtained in the same way as in W1-F12 theory.

The total atomization energies at 0 K (\(\hbox {TAE}_0\)) are converted to heats of formation at 298 K using the Active Thermochemical Tables (ATcT) [5559] atomic heats of formation at 0 K (H 51.633 \(\pm\) 0.000, C 170.024 \(\pm\) 0.014, N 112.469 \(\pm\) 0.007, O 58.997 \(\pm\) 0.000, and S 65.709 \(\pm\) 0.036  kcal/mol), and the CODATA [60] enthalpy functions, \(\hbox {H}_{298}-\hbox {H}_{0}\), for the elemental reference states (\(\hbox {H}_{2}(\hbox {g}) = 2.024\pm 0.000\), C(cr,graphite) = \(0.251\pm 0.005, \hbox {N}_2(\hbox {g}) = 2.072\pm 0.000, \hbox {O}_2(\hbox {g}) = 2.075\pm 0.000\), and S(cr,rhombic) = \(1.054\pm 0.001\) kcal/mol), while the enthalpy functions for the amino acids are obtained within the rigid rotor harmonic oscillator (RRHO) approximation from B3LYP/A’VTZ geometries and harmonic frequencies.

W1-F12 shows excellent performance for systems containing first-row elements (and H). Specifically, for the 97 first-row atomization energies in the W4-11 dataset, [32] W1-F12 attains a root mean square deviation (RMSD) of 0.19 kcal/mol relative to all-electron, relativistic CCSD(T) reference atomization energies at the infinite basis set limit. However, for second-row systems, it was found that the performance of W1-F12 is significantly degraded owing to shortcomings of the cc-pVDZ and cc-pVDZ-F12 basis sets for second-row elements (see Ref. [31] for details): for the 40 second-row atomization energies in the W4-11 dataset, RMSD actually exceeds 1 kcal/mol. W2-F12 does not suffer from this problem and yields similar RMSDs of 0.18 kcal/mol for first-row and 0.24  kcal/mol for second-row systems. (For further details, see reference [31]). Thus, for the sulfur-containing amino acids (cysteine and methionine) and for the small amino acids (alanine, glycine, and serine), the heats of formation are also obtained using W2-F12 theory.

The case of glycine is small enough (especially considering the \(C_s\) symmetry) that the result can be independently verified using accurate thermochemical procedures based on layered extrapolation of orbital basis sets, specifically the high-accuracy W4 method [68]. Full details of the method are given in that reference and will not be repeated here: suffice to say that for a set of molecules where accurate experimental atomization energies are available via ATcT, the RMSD from experiment is 0.10  kcal/mol [32, 68]. The largest-scale calculation involved here, CCSD/aug’-cc-pV6Z, entails 1400 basis functions and required 3 terabyte of scratch space, yet ran to completion within a day on a machine with a large solid-state disk array. The CCSD(T)/aug’-cc-pV5Z calculation, involving 910 basis functions, ran in under a day on 32 cores and 512 GB of RAM.

The heats of formation have also been obtained using computationally more economical composite procedures, namely the Gaussian-4 (G4) protocol [33, 63] and its computationally more economical G4(MP2) and G4(MP2)-6X variants [64, 65]. These calculations were performed using the Gaussian 09 program suite [40]. The G4 and G4(MP2) protocols are widely used for the calculation of thermochemical properties and are applicable to relatively large systems (of up to 20–30 non-hydrogen atoms). They, generally, give RMSDs from experimental or high-accuracy theoretical thermochemical data of 1–2 kcal/mol [32, 63, 64]. For example, for the 454 experimental thermochemical determinations of the G3/05 test set (including heats of formation, ionization energies, and electron affinities), [66] G4 and G4(MP2) attain RMSDs of 1.2 and 1.5 kcal/mol, respectively [63, 64]. For the set of 137 very accurate theoretical atomization energies in the W4-11 set, both procedures attain an RMSD of 2.0  kcal/mol [32]. Finally, we have also considered the performance of the CBS-QB3 procedure [67] using Gaussian 09.

3 Results and discussion

3.1 Computational cost of the W1-F12 calculations

For systems consisting of more than eight non-hydrogen atoms (with \(\hbox {C}_1\) symmetry), W1 theory [50] becomes prohibitively expensive with current commodity server hardware. W1-F12 theory is an explicitly correlated version of the W1 method, [50] which combines explicitly correlated F12 methods [14] with extrapolation techniques in order to approximate the CCSD(T)/CBS energy. Because of the drastically accelerated basis set convergence of the F12 methods [42, 43], W1-F12 is superior to the original W1 method, not only in terms of performance but also in terms of computational cost [31]. For example, the cpu times for calculating W1 and W1-F12 energies for a system containing 8 non-hydrogen atoms (with \(\hbox {C}_1\) symmetry) are 595 and 163 h, respectively (both calculations ran on 8 Intel Xeon Sandy Bridge cores at 2.6 GHz). In terms of disk space requirements, the W1 calculation used about five times the amount of scratch disk (660 GB) that the W1-F12 calculation required (126 GB).

In the present work, we obtain W1-F12 energies for the 18 amino acids with up to 12 non-hydrogen atoms. Of these, the largest amino acids are glutamic acid, glutamine, and lysine (10 non-hydrogen atoms); histidine (11 non-hydrogen atoms); arginine and phenylalanine (12 non-hydrogen atoms). Considering the fact that none of the amino acids (apart from glycine) have any spatial symmetry, these represent the largest W1-F12 calculations reported to date. For example, the W1-F12 calculation for arginine ran for 51 days on 6 Intel Nehalem 8837 cores at 2.67 GHz and used 253 GB of RAM and 1.1 TB of scratch disk. Due to this very steep computational cost, we obtain our best heats of formation for the two amino acids with more than 12 non-hydrogen atoms (i.e., tryptophan and tyrosine) with the G\(n\) and CBS-QB3 methods [33, 67], which have a significantly reduced computational cost. In Sect. 3.4, we show that, relative to W1-F12 and W2-F12 heats of formation, G4, G4(MP2)-6X, and CBS-QB3 result in RMSDs of 0.72, 0.71, and 1.01  kcal/mol, respectively, i.e., near or below the threshold of “chemical accuracy” (traditionally arbitrarily defined as 1 kcal/mol).

3.2 W1-F12 and W2-F12 benchmark heats of formation

Since W1-F12 and W2-F12 theories represent a layered extrapolations to the CCSD(T) basis set limit energy, it is of interest to estimate whether the contributions from post-CCSD(T) excitations are likely to be significant for the atomization energies of the amino acids. The percentage of the total atomization energy accounted for by parenthetical connected triple excitations, \(\%\hbox {TAE}_e\)[(T)], has been shown to be a reliable energy-based diagnostic for the importance of non-dynamical correlation effects [68, 74]. It has been suggested that \(\%\hbox {TAE}_e\)[(T)] \(<\) 2 % indicates systems that are dominated by dynamical correlation, while 2 % \(<\) \(\%\hbox {TAE}_e\)[(T)] \(<\) 5 % indicates systems that include mild non-dynamical correlation. \(\%\hbox {TAE}_e\)[(T)] values for the amino acids are gathered in Table 1. The amino acids are characterized by \(\%\hbox {TAE}_e\)[(T)] values ranging from 1.7 (leucine) to 2.5 % (histidine). Note also that in all cases, the SCF component accounts for 69–77 % of the total atomization energy. These values suggest that our all-electron, non-relativistic, vibrationless benchmark atomization energies should, in principle, be considerably closer than 1 kcal/mol of the atomization energies at the full configuration interaction (FCI) basis set limit. For example, for systems that are associated with similar \(\%\hbox {TAE}_e\)[(T)] values in the W4-11 dataset [32], post-CCSD(T) contributions to the atomization energy are 0.2 kcal/mol or less, although somewhat larger values were found for benzene [75, 76].

Table 1 Diagnostics indicating the importance of post-CCSD(T) contributions for the amino acids

Table 2 gives an overview of basis set convergence of the CCSD-F12 component of the total atomization energy. The magnitude of the valence CCSD-F12 correlation component spans a relatively large range. For example, the CCSD-F12/V{D,T}Z-F12 results extrapolated with \(\alpha\) = 3.67 (which was optimized to minimize the RMSD over 137 first- and second-row systems in the W4-11 dataset [31]) extend from 272.48 (glycine) up to 701.76 (arginine) kcal/mol. The differences between the CCSD-F12/V{D,T}Z-F12 results obtained with \(\alpha\) = 3.67 (optimized over the entire W4-11 set of small molecules) and \(\alpha\) = 3.38 (optimized over the subset of 97 first-row species only) can get quite significant for these medium-sized species, ranging from 0.25 kcal/mol for glycine to 0.71 kcal/mol for arginine. Note that these differences still only correspond to about 0.1 % of the valence CCSD correlation component. For comparison, for the systems in the W4-11 dataset, the absolute differences between the CCSD-F12/V{D,T}Z-F12 component extrapolated with \(\alpha\) =  3.67 and 3.38 are reduced to just 0.00–0.22 kcal/mol, or 0.08 kcal/mol mean absolute—likewise, about 0.1 % of the valence CCSD correlation component of the atomization energy. Finally, using instead the extrapolation exponent optimized by Hill et al. [43] (\(\alpha\) = 3.144), which was optimized over a smaller set of 14 absolute correlation energies, results in atomization energies increased by 0.24 (glycine) up to 0.69 (arginine) kcal/mol over the values with \(\alpha\) = 3.38 (Table 2).

Table 2 Overview of the basis set convergence of the CCSD-F12 component of the total atomization energies for the amino acids (kcal/mol)

For five smaller amino acids (alanine, cystine, glycine, methionine, and serine), we were able to obtain CCSD-F12/VQZ-F12 energies. Table 2 gives the CCSD-F12/V{T,Q}Z-F12 results extrapolated with \(\alpha\) = 5.94 (used in W2-F12 theory [31]) and 4.596 (from Ref. [43]). For these systems, the difference between the CCSD-F12/V{T,Q}Z-F12 contributions extrapolated with \(\alpha\) = 5.94 and 4.596 ranges between 0.20 (glycine) and 0.34 (methionine) (Table 2). We note that the error statistics over the 137 systems in the W4-11 dataset are as follows: RMSD = 0.13, MAD = 0.10, and MSD =  0.01 for \(\alpha\) = 5.94, and RMSD = 0.15, MAD = 0.11, and MSD =  0.08 kcal/mol for \(\alpha\) = 4.596. Peterson and Feller [44] obtained benchmarks extrapolated from basis sets as large as aug-cc-pV8Z for a fairly large sample of molecules that overlaps W4-11 and found that CCSD-F12b/V{T,Q}Z-F12 tends to overestimate the valence CCSD component on average: as they were using \(\alpha\) = 4.596, this is consistent with the present finding. (They also report difficulties reaching 0.1 kcal/mol convergence for CCSD-F12b energies with aug-cc-pV5Z basis sets: We were only able to apply this basis set to glycine, and in any case 0.1 kcal/mol is smaller than other potential error sources in the present work).

Table 3 Component breakdown of the W1-F12 and W2-F12 atomization energies and final gas-phase heats of formation at 0 and 298 K for the lowest-energy conformers of the amino acids (kcal/mol)

For the five W2-F12 amino acids, the RMSDs for CCSD-F12/V{D,T}Z-F12 with various choices of extrapolation exponent are 0.43 (\(\alpha\) = 3.67), 0.14 (\(\alpha\) = 3.38), and 0.24 (\(\alpha\) = 3.144) kcal/mol. Taking the average between the CCSD-F12/V{D,T}Z-F12 components extrapolated with \(\alpha\) = 3.38 and 3.144 results in an RMSD of 0.12 kcal/mol and a mean signed deviation of only +0.06 kcal/mol. We thus use this averaged CCSD-F12/V{D,T}Z-F12 component in our final W1-F12 atomization energies. The spread between the \(\alpha\) = 3.38 and 3.144 values can be considered a crude gauge of the uncertainty in the basis set limit.

Table 4 Dependence of computed ZPVEs (kcal/mol) on the level of theory

The component breakdowns of the W1-F12 and W2-F12 atomization energies are gathered in Table 3. The following general observations may be noted:

  • As pointed out above, the magnitude of the valence CCSD-F12 correlation component runs a large gamut, extending from 272.85 (glycine) up to 702.80 (arginine) kcal/mol.

  • The magnitude of the valence (T) correlation component can be rather large, reaching 54.28 kcal/mol for phenylalanine.

  • The core–valence contribution approaches or exceeds 10 kcal/mol for the largest systems. Namely, it is 9.88 (arginine) and 11.68 (phenylalanine) kcal/mol.

  • The DBOC contribution ranges from 0.28 (glycine) up to as much as 0.72 (arginine) kcal/mol.

Comparison of the W1-F12 and W2-F12 results for alanine, cystine, glycine, methionine, and serine reveals the following:

  • The HF/V{D,T}Z-F12 component systematically underestimates the HF/VQZ-F12 basis set limit, namely by 0.03 (glycine), 0.04 (alanine and cysteine), 0.05 (serine), and 0.08 (methionine) kcal/mol.

  • Our best CCSD-F12/V{D,T}Z-F12 component overestimates the CCSD-F12/V{T,Q}Z-F12 component by 0.05 (glycine), 0.06 (methionine), 0.10 (serine), 0.20 (alanine) kcal/mol, and underestimates it by 0.11 kcal/mol for cysteine.

  • The valence (T) contribution from W1-F12 theory systematically overestimates the W2-F12 results, specifically by 0.06 (cysteine), 0.13 (methionine), 0.17 (glycine), 0.20 (alanine), and 0.25 (serine) kcal/mol.

  • The core–valence contribution from W1-F12 systematically underestimates the W2-F12 result, namely by 0.09 (glycine), 0.12 (alanine), 0.14 (serine), and 0.16 (cysteine) kcal/mol (we were not able to obtain the core–valence contribution for methionine from W2-F12 theory).

  • Overall, the \(\hbox {TAE}_e\) from W1-F12 theory overestimates the \(\hbox {TAE}_e\) from W2-F12 theory by 0.11 (glycine and methionine), 0.16 (serine), and 0.23 (alanine) kcal/mol, and underestimates it by 0.30 kcal/mol for cysteine.

As noted in the “Methods” section, we were able to “cross-check” the result for glycine at the W4 level: the lower-cost W2.2 level is obtained as a by-product. As seen in Table 3, the SCF, CCSD, (T), core-valence, and relativistic components of the W2-F12 calculation are all in excellent agreement with the W4 calculation, the cumulative difference being just 0.04 kcal/mol. The higher-order correlation steps, CCSDT(Q)/cc-pVTZ, and CCSDTQ/cc-pVDZ are more problematic from a computational point of view, but their importance is typically quite small for molecules dominated by a single reference configuration (due to error compensation between “antibonding” higher-order \(T_3\) and “bonding” \(T_4\) contributions [6873]). Absent a direct calculation, their importance can be estimated by assuming that their contribution to the following isodesmic reaction energy will be approximately zero:

$$\begin{aligned} \hbox {CH}_{3}\hbox {COOH} + \hbox {CH}_{3}\hbox {NH}_{2} \rightarrow \hbox {glycine} + \hbox {CH}_{4} \end{aligned}$$
(1)

From Table SI-II of Ref. [32], we find the post-CCSD(T) contributions to the TAEs to be \(-\)0.05 kcal/mol for acetic acid, \(-\)0.09 kcal/mol for methyl amine, and +0.01 kcal/mol for methane, leading to an estimated post-CCSD(T) correction of \(-\)0.15 kcal/mol for glycine.

3.3 A note on zero-point vibrational energies (ZPVEs)

In view of the magnitude of the zero-point vibrational energies (50–140 kcal/mol, see Table 4), some remarks are due concerning their calculation. Ideally, one should obtain them from accurate anharmonic force fields, and for small molecules, this is indeed a practical option [68, 85, 91]. In the present case, however, the computational cost would be prohibitive with the computational resources at hand, and multiplication of calculated harmonic frequencies with a scaling factor \(\lambda (\mathrm{ZPVE})\) appropriate for zero-point vibrational energies [50, 83, 84, 86, 90] is the only practical option. As shown in Ref. [84], ZPVEs are typically almost exact averages of one-half the sum of the harmonics and one-half the sum of the fundamentals, the difference being just \(\mathrm{ZPVE}-(1/4)\sum _i{\omega _i+\nu _i}=G_0-\sum _i X_{ii}/4\), where the \(X_{ii}\) are the diagonal anharmonicity constants and \(G_0\) is the polyatomic counterpart of the small \(Y_{00}\) Dunham constant [82] in diatomics. Consequently [50, 84, 90], the optimal scaling factor for ZPVEs is almost exactly midway between a \(\lambda ({\omega })\) suitable for harmonic frequencies (as an approximate correction for systematic bias in the calculated frequencies) and a \(\lambda ({\nu })\) suitable for fundamental frequencies (which additionally seeks to approximately corrects for anharmonicity). In fact, Alecu et al. [86] found for a large variety of basis sets and ab initio and DFT methods that \(\lambda ({\omega })/\lambda (\mathrm{ZPVE})=1.014\pm 0.002\), which is almost exactly the ratio of 1.0143 found by Perdew and coworkers [87] between harmonic frequencies and ZPVEs derived from experimental anharmonic force fields. Note that the “small” uncertainty of 0.002 on a ZPVE of 140 kcal/mol still would translate to about 0.3 kcal/mol, and even that is probably optimistic for the uncertainty in an individual molecule [88]. It has been argued earlier [91] (see also Ref. [92]) that for organic and bio-organic molecules that are “well-behaved” from an electronic structure point of view, the main factor limiting accuracy in computational thermochemistry may well be the treatment of the nuclear motion, rather than the electronic problem as such.

Computed zero-point vibrational energies for the amino acids at various levels of theory (including those used in the composite thermochemistry schemes compared in this work) are listed in Table 4. In search of an alternative that was more accurate than B3LYP yet still comparatively affordable, we considered the B2PLYP double hybrid functional [93] in conjunction with the def2-TZVPP basis set [94] and optimized a \(\lambda (\omega )\) scaling factor by minimizing the RMSD for the HFREQ27 dataset [95] of accurately known harmonic frequencies. As can be seen in Table 4, the RMSD over the HFREQ27 set is only half that of B3LYP and drops to 13.2  cm\(^{-1}\) if the anomalous F\(_2\) molecule is eliminated. (For comparison, the HFREQ27 RMSD for CCSD(T)/cc-pV(Q+d)Z is still 8.4  cm\(^{-1}\).) The optimum scale factor \(\lambda (\omega )=0.9971\) is very close to unity, and in conjunction with the “universal” ratio of 1.014 translated into \(\lambda (\mathrm{ZPVE})=0.9833\). As a sanity check on our procedure, we re-evaluated the \(\lambda (\mathrm{ZPVE})\) for B3LYP/6-31G(2df,p) and B3LYP/6-311G(2d,d,p) and obtained 0.9858 and 0.9896, respectively, which agree to better than 3 decimal places with the “official” values used in G4 theory and CBS-QB3, respectively [63, 67].

It can be seen in Table 4 that the lower levels of theory used for ZPVEs in G3(MP2) [61] and G3(MP2)B3, [62] can yield values several kcal/mol lower than the highest-level method: the RMSD from B2PLYP/def2-TZVPP are 2.12 and 2.29 kcal/mol, respectively, compared to 0.33 and 0.14 kcal/mol, respectively, for B3LYP/6-31G(2df,p) (scaled by 0.9854) as used by the G4 variants, and B3LYP/6-311G(2d,d,p) as used by CBS-QB3. (The “2d” refers to the use of an extra d function on second-row elements.) But also B3LYP/cc-pV(T+d)Z scaled by 0.985, as used in W1- and W1-F12 theory, appears to yield values that are too low, and indeed \(\lambda (\mathrm{ZPVE})\) as obtained from the HFREQ27 set is 0.9892. For B3LYP with a basis set that is effectively at the Kohn-Sham limit, \(\lambda (\mathrm{ZPVE})\) = 1.004 was found, which corresponds to \(\lambda (\mathrm{ZPVE})\) = 0.99, and the database of Radom and coworkers [90] likewise lists scaling factors near 0.99 for B3LYP with large basis sets. While a scaling factor of 0.985 vs. 0.990 may rightly be considered a distinction without a difference for small molecules (where anybody concerned about 0.1 kcal/mol in a ZPVE should seriously consider an accurate anharmonic ZPVE), the problem is much more obvious in larger systems such as presently considered.

For one system, glycine, an anharmonic value of 49.438 kcal/mol is available due to Puzzarini and coworkers [81], who combined CCSD(T)/CBS harmonic frequencies with a DFT anharmonic force field. Fortuitously, our scaled B2PLYP/def2-TZVPP value agrees to two decimal places. As an additional observation, for ethane, the accurate anharmonic ZPVE is 46.29 kcal/mol, [91] compared to 45.97 kcal/mol B3LYP/cc-pVTZ scaled by 0.985, 46.20 with a revised scaling factor of 0.99, and 46.33 kcal/mol at the B2PLYP/def2-TZVPP level scaled by 0.9833.

3.4 Performance of G\(n\) methods for the heats of formation of the amino acids

In this Section we use our best heats of formation from W1-F12 and W2-F12 theories (given in Table 3) to evaluate the performance of a variety of composite thermochemical Gaussian-\(n\) (G\(n\)) procedures including G3(MP2), [61] G3(MP2)B3, [62], G4, [63] G4(MP2), [64] and G4(MP2)-6X [65]. Table 5 presents the deviations (G\(n\)–W\(n\)-F12) from our benchmark W\(n\)-F12 results, as well as the RMSD, mean absolute deviations (MAD), and mean signed deviations (MSD) for the G\(n\) methods. Stover et al. [17] obtained G3(MP2) heats of formation for the amino acids: except for phenylalanine, cysteine, and methionine, the deviations between their heats of formation and our reference values exceed 1 kcal/mol. The mean signed deviation (MSD) of 1.90 kcal/mol being nearly equal to the RMSD of 2.25 kcal/mol indicates a very systematic error. Simply switching to G3(MP2)B3 cuts the MSD to 0.78 kcal/mol and the RMSD to 1.13 kcal/mol, while “upgrading” to G3B3 lowers these numbers even further to 0.45 and 0.60 kcal/mol, respectively. While both methods use MP2 rather than B3LYP reference geometries, the entire G3 family suffers from underestimated ZPVEs for the amino acids (Table 4), so apparently some of that issue is absorbed by the empirical correction. Stover et al. [17] also obtained G3(MP2) heats of formation via isodesmic bond separation reactions. As expected this improves the performance, with RMSD = 1.48 kcal/mol and a maximum deviation of 2.40 kcal/mol for phenylalanine. We note, however, that their CCSD(T)/CBS anchor value for the heat of formation at room temperature of glycine, \(-\)92.6 kcal/mol, is 1.5 kcal/mol lower than our W2-F12 value. If we substitute the latter in their isodesmic reactions, their RMSD plunges to just 0.47 kcal/mol.

Table 5 Performance of a selection of composite procedures of the G\(n\) family for the calculation of heats of formation (\(\Delta _f H^\circ _{298 K}\), exclusive of conformer correction) of the 18 amino acids in Table 3

We now turn our attention to the performance of the Gaussian-4 family: G4, [63] G4(MP2), [64] and G4(MP2)-6X [65]. The G4(MP2) procedure exhibits somewhat disappointing performance, its RMSD = 1.80 kcal/mol placing intermediately between G3(MP2) and G3(MP2)B3. The largest deviations are obtained for asparagine (2.48), lysine (2.32), glutamine (3.15), and arginine (3.34 kcal/mol), but all other deviations exceed 1 kcal/mol apart from phenylalanine, cysteine, and methionine. The computationally more expensive “full” G4 procedure yields much better performance with an RMSD of 0.72 kcal/mol, and just three cases exceeding 1 kcal/mol (glutamine 1.39, arginine 1.21, and lysine 1.37 kcal/mol). However, an essentially identical RMSD = 0.71 kcal/mol is afforded by the G4(MP2)-6X procedure, which involves the same computational steps and cost as G4(MP2) but entails six additional empirical scaling factors. Deviations larger than 1 kcal/mol are obtained for just four systems, namely arginine (1.63), glutamine (1.63), asparagine (1.10), and methionine (\(-\)1.02 kcal/mol). Finally, we note that the CBS-QB3 method clocks in at RMSD = 1.01 kcal/mol.

Very recently, Ramabhadran et al. [21] determined the enthalpies of formation of cysteine and methionine using their connectivity-based hierarchy (CBH-\(n\)) approach [77, 78]. From their Table 3, the best enthalpies of formation obtained for the lowest-energy conformer at the CBH-2 (isoatomic) rung using experimental heats of formation for the reference species and CCSD(T)/6-311++G(3df,2p) reaction energies are \(-\)96.1 (cysteine) and \(-\)104.3 (methionine) kcal/mol. From their Table 7, we calculate conformer corrections of +0.77 kcal/mol for cysteine and +0.37 kcal/mol for methionine: The latter we actually use in the present work, while the former is slightly less than our own calculation of 0.81 kcal/mol. According to their Table 9, the heats of formation after conformer correction are \(-\)95.3 and \(-\)104.0 kcal/mol (the latter value presumably after roundoff), both more exothermic than our W2-F12 values (Table 3) of \(-\)94.5 and \(-\)102.4 kcal/mol. We do note that some of the experimental data for reference species used in Ref. [21] carry non-trivial uncertainties, which could account for at least some of the discrepancy.

3.5 Comparison with experiment

Comparison with experiment obviously entails thermal corrections. The RRHO approximation will cause some errors, the largest of which will be neglect of the population of the various low-energy conformers. If we neglect the difference between the rovibrational partition functions of the different conformers, then the conformer contribution to the enthalpy function \(\mathrm{hcf}_{298}\equiv H_{T=298}-E_0\) is easily found as [96]

$$\begin{aligned} \mathrm{hcf}_{298}^\mathrm{conf}=\frac{\sum _i x_i \exp ( -x_i)}{\sum _i \exp ( -x_i)} \quad \hbox {where}\quad x_i \equiv \frac{E_i - E_0}{RT} \end{aligned}$$
(2)

where the index \(i\) runs over the conformers. The effect of accounting for different rovibrational partition functions in the different conformers was considered in Ref. [96] for the alkane conformers and is negligible compared to other potential error sources in the present calculation, such as the neglect of anharmonicity and the uncertainty in the basis set extrapolation. Conformer energies were gathered from published calculations in the literature [2124, 81, 100112]: these range from complete basis set CCSD(T) studies for glycine [81] and alanine [24] to relatively low-level MP2 or DFT calculations for some other species. Details are given in the footnotes to Table 3.

Table 6 lists the available experimental gas-phase heats of formation at 298 K (\(\Delta H_{f,298}^{\circ }\)). Our W2-F12 value for alanine (\(-\)101.5 kcal/mol) is spot on the experimental value of Dorofeeva and Ryzhova [97] (\(-\)101.5\(\,\pm\,\)0.5 kcal/mol) and still agrees to within mutual uncertainties with that of da Silva et al. [15] (\(-\)101.9\(\,\pm\,\)0.7). However, the NIST chemistry WebBook [79] value (\(-\)99.1\(\,\pm\,\)1.0 kcal/mol) is clearly incompatible with our calculations.

Table 6 Experimental gas-phase heats of formation at 298 K for the amino acids (kcal/mol)

Our W2-F12 heat of formation for cysteine (\(-\)94.2 kcal/mol) suggests that the experimental value of Roux et al. [19] should be revised downward by about 2.8 kcal/mol; the recent study of Ramabhadran et al. [21] suggests even further downward revision (vide supra). As for glycine, the W2-F12 heat of formation (\(-\)94.1 kcal/mol from W2-F12, \(-\)94.0 from quasi-W4) and the available experimental values agree to within overlapping uncertainties. Specifically, our calculations are spot on the experimental value of Dorofeeva and Ryzhova [97] (\(-\)94.1\(\,\pm\,\)0.4 kcal/mol), just slightly below the experimental value from the CRC Handbook (\(-\)93.7 kcal/mol), and in the upper end of the uncertainty band of the NIST WebBook value (\(-\)93.3\(\,\pm\,\)1.1 kcal/mol). Our W2-F12 value for methionine (\(-\)102.4 kcal/mol) agrees well with the new measurement of Roux et al. [18] (\(-\)102.8\(\,\pm\,\)2.4 kcal/mol), and both imply a downward revision of the NIST Chemistry Webbook value (\(-\)98.8\(\,\pm\,\)1.0 kcal/mol) by about 3–4 kcal/mol. As for phenylalanine, our W1-F12 value (\(-\)76.9 kcal/mol) suggests that the experimental value from the CRC Handbook (\(-\)74.8 kcal/mol) should be revised downward by about 2 kcal/mol. The W1-F12 values for proline (\(-\)92.8 kcal/mol) and valine (\(-\)113.6 kcal/mol) suggest that the experimental values should be revised downward by about 5 kcal/mol (Table 6).

For the two largest amino acids, tryptophan and tyrosine, we were unable to calculate W1-F12 atomization energies. At the G4, CBS-QB3, and G4(MP2)-6X levels, respectively, we obtain heats of formation at 0 K for tryptophan of \(-\)49.60, \(-\)47.87, and \(-\)48.77 kcal/mol, and for tyrosine of \(-\)109.12, \(-\)108.58, and \(-\)108.49 kcal/mol. At room temperature, the corresponding values are \(-\)59.98, \(-\)58.27, and \(-\)58.98 kcal/mol for tryptophan and \(-\)118.56, \(-\)118.03, and \(-\)117.78 kcal/mol for tyrosine. Averaging over all three levels of theory, and adding in conformers corrections for tryptophan of 0.71 kcal/mol [111] and for tyrosine of 0.65 kcal/mol, we finally obtain estimated heats of formation at 298 K of \(-\)58.37 kcal/mol for tryptophan, and of \(-\)117.47 kcal/mol for tyrosine.

4 Conclusions

We have obtained benchmark heats of formation at the CCSD(T)/CBS limit for the 20 natural amino acids. Our best heats of formation at 298 K (\(\Delta H_{f,298}^{\circ }\)) are \(-\)101.5 (alanine), \(-\)98.8 (arginine), \(-\)146.5 (asparagine), \(-\)189.6 (aspartic acid), \(-\)94.5 (cysteine), \(-\)151.0 (glutamine), \(-\)195.5 (glutamic acid), \(-\)94.0 (glycine, quasi-W4) or \(-\)94.1 (glycine, W2-F12), \(-\)69.8 (histidine), \(-\)118.3 (isoleucine), \(-\)118.8 (leucine), \(-\)110.0 (lysine), \(-\)102.4 (methionine), \(-\)76.9 (phenylalanine), \(-\)92.8 (proline), \(-\)139.2 (serine), \(-\)149.0 (threonine), and \(-\)113.6 (valine) kcal/mol. These heats of formation are obtained at the W2-F12 level for alanine, cysteine, glycine, methionine, and serine, and at the W1-F12 level for all of the rest. For the two largest amino acids, an average over G4, G4(MP2)-6X, and CBS-QB3 yields best estimates of \(-\)58.4 kcal/mol for tryptophan, and of \(-\)117.5 kcal/mol for tyrosine.

Uncertainties caused by issues with the zero-point vibrational energy and the conformer corrections rival, and probably exceed, those directly related to the electronic structure treatment. The overall uncertainty is somewhat difficult to quantify, but a semi-quantitative estimate would range from about \(\pm\)0.5 kcal/mol for the smaller, to about \(\pm\)1 kcal/mol for the larger, amino acids.

For glycine, by way of validation, we were able to obtain a “quasi-W4” result corresponding to \(\hbox {TAE}_e=968.1, \hbox {TAE}_0=918.6, \Delta H_{f,298}^{\circ }\) = \(-\)90.0, and \(\Delta H^\circ _{f,298}\) = \(-\)94.0 kcal/mol.

Our best theoretical values suggest that the experimental gas-phase heats of formation from the NIST WebBook should be revised downward by 2.4 (alanine), 0.7–0.8 (glycine), 3.2 (methionine), and 5.3 (proline) kcal/mol. Similarly, we suggest that the experimental values from the CRC Handbook should be revised downward by 0.4 (glycine), 2.0 (phenylalanine), and 4.8 (valine) kcal/mol. Our best theoretical values are in good agreement with the recently reported experimental values of Roux and coworkers for alanine [15] and methionine, [18] but suggest that their experimental value for cysteine should be revised downward by 2.8 kcal/mol. Finally, our best theoretical values for alanine and glycine are in excellent agreement with the recent values of Dorofeeva and Ryzhova [97].

Using our W1-F12 and W2-F12 benchmark heats of formation, we benchmark the performance of the empirical composite G\(n\) procedures. We obtain the following RMSDs: 2.25 (G3(MP2)), 1.13 (G3(MP2)B3), 0.60 (G3B3), 1.80 (G4(MP2)), 0.71 (G4(MP2)-6X), and 0.72 (G4) kcal/mol. Particularly G4(MP2)-6X appears to offer an excellent performance-to-computational cost ratio.

Finally, it appears that for W1- and W1-F12, the scaling factor for the B3LYP/cc-pV(T+dZ)Z or B3LYP/aug’-cc-pV(T+d)Z zero-point vibrational energy should be revised upward to 0.990.

5 Supporting information

B3LYP/A’VTZ optimized geometries for the species considered in the present work (Table S1). Full references for ref [40] (Gaussian 09) and ref [41] (Molpro 2010) (Table S2). B2PLYP/def2-TZVPP harmonic frequencies for all amino acids except tryptophan and tyrosine, and B3LYP/aug’-cc-pV(T+d)Z frequencies for all amino acids.