Introduction

Precise knowledge of intermolecular interactions between proteins and their ligands is of critical importance in rational drug or biocatalyst design. Experimental studies yield still very limited information in this respect, while rigorous ab initio calculations are prohibitively expensive, due to their steep scaling with basis set size. Therefore, simulations of large biomolecular systems are usually based on empirical force fields (FF), which exhibit limited transferability between molecular systems mainly due to the electrostatic term [13] and do not preserve any clear physical meaning of its particular non-bonded contributions. Historically, first conventional non-bonded force field parameters have been fitted to reproduce experimentally available data such as enthalpies of vaporization or sublimation, however they usually differ considerably between themselves. Alternative and more reliable source of data for FF fitting constitute accurate quantum chemical interaction energies, which may cover much wider range of intermolecular distances. Therefore, such FF could be called non-empirical. To our knowledge, only a few attempts have been made to derive universal [1, 4] FF nonbonded energy terms directly by fitting interaction energy components as defined by perturbation theory. Corresponding system-specific potential functions were much more frequent [58]. In this contribution, we focus on improving the functional form for the exchange repulsion term, which has been noted as a major source of error in conventional force fields [9]. We have already encountered this issue when studying inhibitors docked into urokinase, where some inhibitors were docked by a force field 0.5 Å too deep into the active site compared to ab initio results [10]. One of the reasons for this could be an inappropriate functional form for the repulsion term, usually represented by α R −12 or αexp(−β R), whereas the analytical formula for the first order exchange term involves squared overlap integrals, composed in turn as products of complex polynomials and exponential functions [11, 12].

With this in mind, we here explore in a systematic way the various possible analytical formulas for reproducing the first-order exchange component for large number of biomolecular complexes and determine their optimal parameters. To facilitate applications in conventional software, we limit ourselves to products of polynomial and exponential functions resembling the above mentioned, since they can be easily incorporated into typical force fields.

This work is an extension of our earlier report [1], in which perhaps one of the first universal non-empirical atom–atom potentials were derived for main interaction energy components. These were calculated in a minimal valence basis set for 336 hydrogen-bonded dimers and tested on the packing of N 2, C O 2, and nitromethane crystals [13]. The present contribution is based on an extended aug-cc-pVDZ basis set and a much wider collection of 660 biomolecular complexes, including various types of interactions besides hydrogen bonds. We tested the resulting nonempirical atom–atom potentials on the mentioned urokinase–inhibitor complexes, which exhibited artefact structures when optimized with a standard Tripos 5.2 force field [10].

Interaction energy components

The only rational way to determine non-empirical atom–atom potentials (NEAAP) is to partition the intermolecular interaction energy ΔE into some well-defined components, which can separate system-specific terms (for example, first-order electrostatic) from other more transferable contributions with different distance dependence requiring different analytical representation. There are many possible interaction energy partitioning methods, mostly related to variational Morokuma scheme [14] or symmetry adapted perturbation theory (SAPT) [15]. On the other hand, non-bonded interactions can be explained in an elegant way using Hellmann–Feynman theorem [16]. Unfortunately, the Hellmann–Feynman theorem has not been practically demonstrated to provide simple potential functions yet. In this study, we have applied the hybrid variation-perturbation theory (HVPT) [17, 18], in which the total interaction energy for systems containing over 1800 AOs [19] can be partitioned into the following contributions:

$${\Delta} E =E_{EL,MTP}^{(10)} + E_{EL,PEN}^{(10)} + E_{EX}^{(10)} + E_{DEL}^{(R0)} + E_{CORR}, $$

The first-order multipole electrostatic term \(E_{EL,MTP}^{(10)}\) is obtained using SCF monomer cumulative atomic multipole moments (CAMMs) \( M_{A}^{(k_{a})} \) and \( M_{B}^{(k_{b})} \) for all atom pairs of A and B molecules [20]:

$$\begin{array}{@{}rcl@{}} E_{EL,MTP}^{(10)} &=& {\sum\limits_{a}^{A}} {\sum\limits_{b}^{B}} \sum\limits_{k_{a}} \sum\limits_{k_{b}} M_{A}^{(k_{a})}[k_{a}]T^{(k_{a}+k_{b})}[k_{b}]M_{B}^{(k_{b})}\\ k_{a} &+& k_{b} \leq L \end{array} $$

where M (k) is a rank k multipole and \(\phantom {\dot {i}\!}T^{k_{a}+ k_{b}}\) is the Cartesian interaction tensor containing all the partial derivatives of |R a b |−1 of rank k a + k b . In this contribution, the CAMM expansion was truncated at the R −5 term (rank 5) yielding best convergence [21] and calculated using the GAMESS-US ab initio package (activated by adding $ELMOM IAMM=n $END to the input file, where n is the desired highest rank) [22]. Higher atomic multipole moments can easily be transformed into alternative point charge models [23, 24].

The complete first-order electrostatic term \( E_{EL}^{(10)}\) is calculated in the dimer basis set as the first-order perturbational correction within the polarization approximation [17] and is equivalent to the analogous term defined within Symmetry Adapted Perturbation Theory (SAPT) [15].

$$\begin{array}{@{}rcl@{}} E_{EL}^{(10)} &=& < {\Psi}_{A} {\Psi}_{B} \mid \hat{H}_{AB} - \hat{H}_{A} - \hat{H}_{B} \mid {\Psi}_{A} {\Psi}_{B} > / < {\Psi}_{A} {\Psi}_{B} \mid {\Psi}_{A} {\Psi}_{B} >\\ &=&{\sum\limits_{a}^{A}} {\sum\limits_{b}^{B}} Z_{a} Z_{b} R_{ab}^{-1} + \sum\limits_{r}^{AB} \sum\limits_{s}^{AB} \sum\limits_{t}^{AB} \sum\limits_{u}^{AB} D_{rs}^{A}{(D)} D_{tu}^{B} {(D)} <rs \mid tu>\\ &+& \sum\limits_{r}^{AB} \sum\limits_{s}^{AB} {\sum\limits_{b}^{B}} D_{rs}^{A}{(D)} < r \mid Z_{b} R_{1b}^{-1} \mid s > \\ &+& \sum\limits_{t}^{AB} \sum\limits_{u}^{AB} {\sum\limits_{a}^{A}} D_{tu}^{B}{(D)} < t \mid Z_{a} R_{1a}^{-1} \mid u >\end{array} $$

where monomer electron densities \( D_{rs}^{A}{(D)} \), \( D_{tu}^{B}{(D)} \) have been obtained in dimer basis set D = A + B, whereas Z a , Z b denote nuclear charges, < r st u >, \(< r \mid Z_{b} R_{1b}^{-1} \mid s >\), two-electron electron repulsion and one-electron nuclear attraction integrals, respectively.

The electrostatic penetration term \( E_{EL,PEN}^{(10)}\) is defined as the difference between the entire electrostatic energy \( E_{EL}^{(10)}\) and its multipole component \( E_{EL,MTP}^{(10)}\):

$$E_{EL,PEN}^{(10)} = E_{EL}^{(10)} - E_{EL,MTP}^{(10)} $$

Taking mutually orthogonalized monomer wavefunctions obtained in the dimer basis set as a starting point, the first-order Heitler–London interaction energy E (10) is calculated as the difference between the AB dimer energy E A B (D) at iteration zero and the monomer energies E A (D) and E B (D)obtained in the dimer basis set:

$$E^{(10)} = E_{AB}(D) - E_{A}(D) - E_{B}(D) $$

Neglecting small Murrell delta term, the exchange repulsion component \(E_{EX}^{(10)}\) is then defined as the difference between the Heitler–London E (10) term defined above and the electrostatic component \(E_{EL}^{(10)}\):

$$E_{EX}^{(10)} = E^{(10)} - E_{EL}^{(10)} $$

Another component \( E_{DEL}^{(R0)} \), called the delocalization term, covers higher-order (R) induction and exchange-deformation interactions [25] and is obtained as the difference between the converged SCF interaction energy ΔE SCF and the Heitler–London term E (10):

$$E_{DEL}^{(R0)} = {\Delta} E^{SCF} - E^{(10)} $$

It has to be noted that further partitioning \( E_{DEL}^{(R0)} \) yields strongly basis set-dependent induction and charge transfer terms whereas their sum, i.e., delocalization term does not display such dependency [26]. Dispersion and exchange dispersion terms obtained within SAPT approach (\( E_{DISP}^{(2)} + E_{EX-DISP}^{(2)}\)) as well as first-order correlation correction 𝜖(1) can be closely approximated by atom–atom potentials that include damping functions D a s [27] represents inter- and intra-molecular correlation term E C O R R .

High quality of D a s functions allows to supplement SCF interaction energy [27] by correlation effects avoiding well-known deficiencies of DFT or MP2 methods to represent dispersion interactions.

Thus, the total interaction energy used in this study can be expressed as the sum of the following terms:

$${\Delta} E =E_{EL,MTP}^{(10)} + E_{EL,PEN}^{(10)} + E_{EX}^{(10)} + E_{DEL}^{(R0)} + D_{as},$$

where both long-range \(E_{EL,MTP}^{(10)}\) and D a s terms scale with the number of atoms squared O(A 2) and will be supplemented by similarly scaling atom–atom potentials derived in this work to approximate the short-range \(E_{EL,PEN}^{(10)}\), \(E_{EX}^{(10)}\) and \( E_{DEL}^{(R0)}\) terms yielding complete nonempirical estimate of major nonbonded interactions applicable in any force field.

Results and discussion

Selection of the optimal functional form for the exchange repulsion term

Due to the critical importance of the repulsion term in empirical force fields [9], we extensively tested various possible functions starting with the most popular ones such as α R −12, αexp(−γ R) and ending with (α + β R −1+δ R + κ R 2+ω R −3)exp(−γ R), as well as simpler intermediate versions. As reference data, we assumed values of the first-order exchange term obtained for 660 dimers of biomolecular complexes using the aug-cc-pVDZ basis set, generated from the S66 training set [28]. This set included 23 hydrogen-bonded, 23 dispersion-dominated, and 20 mixed molecular complexes composed of hydrogen, carbon, nitrogen, and oxygen only. Additional inclusion of sulphur, phosphorus, as well as halogen complexes involved in sigma-hole bonding is planned in future. Besides six shortest original distances defined by the ratio R/R e q , we generated an additional S66x4 set to cover shorter distances critical for repulsion interactions. The parameters α, β, γ, δ, κ, and ω that appear in the longest functional form given above approximating the nonempirical exchange energy, were optimized by nonlinear least-squares fitting with all weights equal 1 using Powell’s conjugate direction method [29]. The corresponding total root mean square errors (RMSE) as well as for distances close to equilibrium structures (eq) are given in Table 1, together with mean unsigned error (MUE) (Table 2) and mean unsigned relative error (MURE) values (Table 3). The results presented in Tables 1- 3 indicate that the α R −12 function fails completely, whereas among all functions considered (α + β R −1)exp(−γ R) seems to provide an optimal representation of the exchange repulsion interaction around equilibrium distances as illustrated in Fig. 1. The distance dependence of the ratio of various approximations the and exact reference \( E_{EX}^{(10)}\) values is shown in Fig. 1 for the methanol dimer. Analogous plots for acetate–methanol, methylamine–methanol, and methylammonium–methanol complexes are shown as Figs. 2, 3, and 4. Again, the α R −12 function from the Amber force field or NEAAP seem to yield considerably underestimated repulsion over the entire distance range. Very poor performance of Amber repulsive FF term could be due to direct coupling it to R −6 van der Waals component via imprecisely defined well depth and equilibrium distance. Since the optimal function (α + β R −1)exp(−γ R) resembles the conventional force field expression, it could be easily incorporated into existing molecular mechanics or dynamics packages. Computationally costly contribution of the three-body interactions dominated by induction term seems to be negligible and has only small influence on final geometries.

Table 1 Root mean square errors (RMSE) for various functional forms approximating the exchange interaction energy \( E_{EX}^{(10)}\) (in [kcal/mol])
Table 2 Mean unsigned error (MUE) for various analytic functions approximating exchange interaction energies \( E_{EX}^{(10)}\) (in [kcal/mol])
Table 3 Mean unsigned relative error (MURE) for various analytic functions approximating exchange interaction energies \( E_{EX}^{(10)}\) (in [%])
Fig. 1
figure 1

The distance dependence of the ratio of various functions approximating the repulsion term and the exact reference exchange term \( E_{EX}^{(10)}\), for the methanol dimer

Fig. 2
figure 2

The distance dependence of the ratio of various functions approximating repulsion term and exact reference exchange term \( E_{EX}^{(10)}\), for the methanol dimer

Fig. 3
figure 3

The distance dependence of the ratio of various functions approximating repulsion term and exact reference exchange term \( E_{EX}^{(10)}\)values for acetate-methanol complex

Fig. 4
figure 4

The distance dependence of the ratio of various functions approximating repulsion term and exact reference exchange \( E_{EX}^{(10)}\) values for methylamine–methanol complex

Atom–atom representation of the remaining interaction energy terms

Due to the presumably critical role of the exchange repulsion term in determining equilibrium geometries, the (α + β R −1)exp(−γ R) function has also been applied to approximate the remaining short-range components, namely delocalization \(E_{DEL}^{(R0)} \) and electrostatic penetration \( E_{EL,PEN}^{(10)}\), in order to compare them on an equal footing. The values of root mean square errors around equilibrium presented in Table 4 indicate that the delocalization term \( E_{DEL}^{(R0)}\) could be also reasonably represented by (α + β R −1)exp(−γ R). On the other hand, the electrostatic penetration component \(E_{EL,PEN}^{(10)}\) seems to require a more complex functional form, for example like the one recently proposed by Tafipolsky and Engels [30]. Due to the relatively small contribution of electrostatic penetration effects (on average, electrostatic penetration at equilibrium in the S66 test set is 14.8 times smaller than exchange repulsion), we kept its functional form the same as for the exchange and delocalization terms for the sake of simplicity. The corresponding α, β, and γ parameters are given in Tables 57. Combining \(E_{DEL}^{(R0)} \) and \( E_{EX}^{(10)}\), or \(E_{DEL}^{(R0)} \), \( E_{EX}^{(10)}\) and \( E_{EL,PEN}^{(10)}\) terms leads to some RMSE reduction (Table 4) due to error compensation, but we did not resort to this in order to keep a clear meaning for all interaction energy components.

Table 4 Root mean square errors (RMSE) for short-range interaction energy components (in [kcal/mol]) approximated by \((\alpha +\beta R^{-1})\exp (-\gamma R)\) function; R/R e q values given in parenthesis
Table 5 α, β and γ NEAAP parameters for exchange term \( E_{EX}^{(10)} \) (a.u.)
Table 6 α, β and γ NEAAP parameters for delocalization component \(E_{DEL}^{(R0)} \) (a.u.)
Table 7 α, β and γ NEAAP parameters for penetration term \(E_{EL,PEN}^{(10)} \) (a.u.)

Testing nonempirical atom–atom potentials for inhibitor–active site complexes

The NEAAP potentials derived for S66 model biomolecular complexes have been tested for several urokinase–inhibitor complexes [10], where a force field docking resulted in several short contacts shown in Table 8. Equilibrium distances for active site amino acid-inhibitor contacts obtained using the standard Tripos 5.2 force field and our atom–atom potentials are compared in Fig. 6 alongside MP2 results. Clearly, the application of nonempirical atom–atom potentials results in considerable improvement, yielding a correlation coefficient R 2 = 0.92 between NEAAP contact distances and MP2 data, in contrast to the force field value of R 2 = 0.64. This feature can be very useful in improving the quality of structural predictions, which can be of practical importance in drug design and scoring. It is possible that the occurrence of artefact short contacts in simulations based on conventional force fields is underreported in the literature as locating it requires significant computational effort and incorrect data are overshadowed by other results while calculating statistical averages (Figure 6).

Table 8 Equilibrium distances [Å] between inhibitors and residues at the active site of urokinase determined by force field [10], NEAAP potential functions and ab initio calculations [10]. Values in parentheses indicate deviation from MP2 equilibrium short contacts
Fig. 5
figure 5

The distance dependence of the ratio of various approximations of the exchange repulsion term and exact reference \( E_{EX}^{(10)}\) values for methylammonium-methanol complex

Fig. 6
figure 6

Equilibrium separations predicted based on nonempirical atom–atom potentials (NEAAP) or the Tripos force field, compared to corresponding ab initio MP2 results for urokinase and its inhibitors

Conclusions

Derivation of universal nonempirical atom–atom potentials NEAAP from interaction energy components defined within hybrid variation-perturbation theory HVPT opens the possibility for systematic improvements to force field nonbonded terms that are critical for more accurate modeling of molecular materials. This study indicates that the analytical functions α R −12 or αexp(−β R) commonly used to represent repulsive interactions are not adequate. By using the more appropriate but still relatively simple form (α + β R −1)exp(−γ R) it is possible to obtain a better description of exchange repulsion over a wide range of distances, especially around equilibrium. The application of derived NEAAPs resulted in a considerable improvement of the structural characteristics for an enzyme–inhibitor complex that exhibited artefact short contacts when a conventional force field was applied in the past [10].