1 Introduction

The dynamics of molecular systems exhibits a distinct metastable character: Molecular systems tend to remain within an almost invariant subset of conformational space for a long time—long in relation to the step size of the numerical integration, which for atomistic simulations is in the order of one or two femtoseconds—while transitions between different almost invariant subsets (i.e. conformational changes) are rarely observed events. This characteristic is due to the rough potential energy landscape inherent to most molecular systems. Basins of low potential energy, grouped around local minima, are separated by high energy barriers, corresponding to conformational changes , or changes from unbound to bound state. This complicates the sampling of conformational space, as molecular dynamics (MD) trajectories tend to generate states from within the basin of one local minimum for a long time, while transitions between different local minima are achieved only very seldom, or not at all. This effect, often denoted as trapping, can lead to incomplete coverage of conformational space, and thus to insufficient statistics. It is particularly severe with regard to the sampling of transient regions of conformational space, e.g. in the study of ligand-receptor binding processes, as the dynamics of the system will try to avoid the energetically unfavorable (but most interesting) transition states.

While, as of yet, thermostated long-time MD remains the predominant tool in the molecular simulation community, several successful strategies for overcoming (or rather lessening) the sampling problem have been developed, including umbrella sampling [1], essential dynamics [2] and replica exchange [3]. An excellent implementation of various enhanced sampling schemes is available in terms of the PLUMED plug-in [4] that is compatible with various popular MD packages.

In this article, we present an enhanced version of the ZIBgridfree sampling algorithm [5], which is inspired by the umbrella sampling approach. ZIBgridfree uses an adaptive refinement strategy in order to enable efficient and thorough sampling even in transient regions of conformational space. The main feature of ZIBgridfree as presented here is that it combines an efficient importance sampling scheme with a comprehensive and visual framework for conformational analysis w.r.t. both single molecules and binding processes.

In the initial step of the algorithm, conformational space is partitioned into subsets. Each subset is sampled independently toward convergence of the correct local distribution. More precisely, instead of computing only one trajectory for exploring the potential energy landscape, we compute short trajectories which are confined to a subset of the conformation space by restraints. These subsets then are defined by a partition of unity on the conformation space. If convergence fails (e.g. when the sampling keeps on “jumping” between two local minima), a refinement of the partitioning is triggered, followed by additional sampling. In the subsequent step, each local sampling will be weighted such that the overall histogram yields the global Boltzmann distribution, so that the identification of conformations is reduced to a clustering problem based on the eigenstates of the overlap matrix of the partitioning. Finally, conformational weights and inter-conformational transition probabilities can be determined. The extended version of ZIBgridfree presented here broadens the scope of this sampling scheme by combining it with a standard MD software package so as to give access to the most up-to-date molecular force fields and solvent models.

2 Theory and implementation

2.1 Conformation dynamics

As partitioning methods based on meshes or grids suffer from the “curse of dimensionality”, ZIBgridfree implements a meshless, function-based partitioning approach. This is motivated by the concept of conformation dynamics [6, 7], where conformations of a molecular system are defined in terms of soft-characteristic membership functions, rather than classical sets in position space (below denoted as \(\Omega \)). We are interested in a soft partitioning of the position space, i.e. we want to have a set of functions \(\chi _1, \dots ,\chi _{n_c} :\Omega \rightarrow \left[ 0,1\right] \) such that

$$\begin{aligned} \sum _{i=1}^{n_C}{\chi _i(q)=1}, \end{aligned}$$
(1)

holds for all \(q \in \Omega \). One can regard \(\chi _i\) as a probability distribution. For a set of position states we say that they are distributed according \(\chi _i\) when for each collection of conformations \(A\) we find \(\int _A \frac{\chi _i(q)}{\tilde{w}_i} \, \rho (q) \, dq\) percent of position states from the set in a conformation from A, with the corresponding thermodynamical weights

$$\begin{aligned} \tilde{w}_i {:=} \int \limits _\Omega \chi _i(q) \, \rho (q) \, dq. \end{aligned}$$
(2)

This means the position states are distributed according to the partial density function \(\tilde{\rho }_i\):

$$\begin{aligned} \tilde{\rho }_i = \frac{\chi _i(q)\rho (q)}{\tilde{w}_i}. \end{aligned}$$
(3)

Note that for the special case \(\chi _1,\dots ,\chi _{n_c} :\Omega \rightarrow \{ 0,1 \}\) our approach reduces to the well known Markov State Model [810]. In this case \(\tilde{w}_i\) is the probability to be in set \(A_i:= \{q \in \Omega \mid \chi _i(q)=1\}\) and the transition matrix \(T\) for some fixed time step \(\tau \) is defined such that \(T_{ij}\) denotes the probability to move from set \(A_i\) to set \(A_j\) in time \(\tau \). In general \(\tilde{w}_i\) denotes the probability that the molecule will be found in the conformation represented by \(\chi _i\) and the transition matrix \(T\) for some time step \(\tau \) is given in the following way: If we have a set of position states distributed according \(\chi _i\) then after a time step \(\tau \) they will be distributed according \(\sum _{k=1}^n \chi _k T_{ik}\). One new property of \(T\) is that the entries do not need to be positive. A partition into metastable conformation is given if we find a soft partitioning such that each distribution \(\chi _i\) represents a metastable conformation, i.e. \(T_{ii} \approx 1\) for \(i=1,\dots ,n_c\). In the following we show how one can obtain such a soft partitioning in metastable conformations and conclude with three examples where we have approximated \(\tilde{w}_i\) for each. For one example we have also approximated the transitions matrix.

To find \(\chi _1,\dots , \chi _{n_C}\) we start off with a function basis \(\phi _1,\ldots ,\phi _s:\Omega \rightarrow \left[ 0,1\right] \), where the initial number of basis functions \(s\) should be chosen larger than the anticipated number of conformations \(n_c\). The function basis is chosen such that is has the same properties as the membership functions \(\chi _1,\ldots ,\chi _{n_C}\), i.e. partition of unity (cp. Eq. 1). Therefore, each conformation membership function \(\chi _j\) can be constructed from a convex combination of the basis functions \(\phi _i\) [11]:

$$\begin{aligned} \chi _j=\sum _{i=1}^s{\chi _{disc}(i,j)\phi _i},\ j=1,\ldots ,n_C, \end{aligned}$$
(4)

where \(\chi _{disc}\) is a row-stochastic matrix containing the linear combination factors. Analogous to \(\tilde{\rho }_i\) and \(\tilde{w}_i\) in Eqs. 3 and 2, each of the basis function is associated with a partial density \(\rho _i\) and a thermodynamic weight \(w_i\). In order to calculate a set of points distributed according \(\phi _i\) one can simulate a trajectory according to the modified potential energy function \(\tilde{U}_i\) as [11]

$$\begin{aligned} \tilde{U}_i(q) = U(q)+\hat{U}_i(q) = U(q)-\frac{1}{\beta }\ln {(\phi _i(q))}. \end{aligned}$$
(5)

This fact will come in handy for calculating the corresponding \(w_i\) and the subsequent cluster analysis which aims at identifying both the correct number of clusters \(n_C\), as well as the matrix \(\chi _{disc}\) of linear combination factors, from which we obtain the set of membership functions \(\chi _j\) by applying   Eq. 4.

As a precondition for the partitioning discussed above, a rough scheme of the relevant position space has to be given. This can be delivered in terms of a long-time MD trajectory (possibly using elevated temperature for improved coverage of position space), a targeted MD or pulling trajectory, the output of certain tools for exploring conformational space (e.g. CONCOORD [12] for protein structures) or even by manually preparing a sequence of geometries. From this presampling is selected a set of nodes \(\lbrace n_1,\ldots ,n_s \rbrace \in \Omega \) to each of which is attached a radial basis function \(W_i\) given by

$$\begin{aligned} W_i(q) = \exp {(-\alpha \,\delta ^2(q,n_i))},\ i=1,\ldots ,s, \end{aligned}$$
(6)

where \(\alpha \) is a shape parameter, and \(\delta ^2\) a distance measure to be specified in the next section. As the basis functions \(W_i\) do not satisfy Equation 1, we construct a partition of unity with basis functions \(\phi _i\) by following Shepard’s approach [13]:

$$\begin{aligned} \phi _i:=\frac{W_i}{\sum \limits _{j=1}^s{W_j}},\ i=1,\ldots ,s. \end{aligned}$$
(7)

The basis functions \(\phi _i\) take on their maximum at the defining node \(n_i\), and decrease exponentially as the distance \(\delta ^2\) of a state \(q\) to \(n_i\) increases. As a consequence, the difference between \(\tilde{U}_i\) (Eq. 5) and \(U\) is minimal within the state \(n_i\), and increases exponentially with the distance to \(n_i\). This ensures thorough sampling in the area belonging to basis function \(\phi _i\), as the sampling process is restrained from wandering off into a lower energy basin. The shape parameter \(\alpha \) is chosen in dependence on the number of nodes \(s\) and the mean node distance \(\theta \), and defines the degree of separation of the meshless discretization. For \(\alpha \rightarrow \infty \), the discretization converges to a Voronoi tessellation, i.e. the soft partitioning degenerates into a hard partitioning without overlaps between the basis functions.

In practice, the sampling of the basis functions \(\phi _i\) is run in parallel, as each \(\tilde{U}_i\) can be evaluated at every position \(q \in \Omega \) independently of all \(\tilde{U}_j\) with \(j \ne i\). Depending on the available resources, one can either sample several basis functions in parallel, evaluate the potential \(\tilde{U}_i\) in parallel (which in turn accelerates the sampling of the associated basis function), or combine both approaches.

2.2 Internal coordinates

ZIBgridfree uses internal coordinates (either torsion angles and/or distances) as collective variables in order to define the conformation of the system under observation. Prior to picking a set of nodes for discretization, a set of \(n_K\) internal coordinates has to be specified by the user. The distance \(\delta ^2(q,n_i)\) between state \(q\) and node \(n_i\) (Eq. 6) is measured in the space of internal coordinates. Therefore, the outcome of the discretization is directly related to the choice of internal coordinates. Deciding on a meaningful set of internal coordinates is not always trivial. For conformational analysis of small molecules, picking all rotatable torsion angles is an obvious choice, whereas for peptides or proteins, picking only backbone torsion angles is practical. For complexes of multiple molecules, the set of torsion angles has to be complemented by a set of distances in order to describe the molecules’ relative positioning to each other.

Whereas angular internal coordinates can only take on values between \(-\pi \) and +\(\pi \), distance (or linear) coordinates can in principle take on any positive value. This leads to problems whenever linear coordinates with a large spread or a large absolute value are overly dominant, as other internal coordinates with more subtle changes are rendered irrelevant when the distance function \(\delta ^2\) is evaluated. In order to tackle this problem, linear coordinates can be weighted and normalized automatically by calling zgf_create_pool with option ‘–balance-linears’.

Let \(k\) be a linear coordinate that corresponds to the Euclidean distance between two particles in the system under observation. The weight of this coordinate is then determined as follows:

$$\begin{aligned} \mathtt{coord\_weight}(k) = \frac{ \mathtt{coord\_weight}(k)_{\textit{initial}} }{ \sqrt{2*\textit{var}(k)}}, \end{aligned}$$
(8)

where \( \mathtt{coord\_weight}(k)_{initial}\) is one, unless specified differently by the user. This means that coordinates with a high spread are downgraded by dividing the initial weight by the full width at half maximum. Furthermore, an offset for \(k\) is applied by subtracting its mean value in order to compensate for high absolute values. This leads to the following weighting formula:

$$\begin{aligned} k_{\textit{balanced}}&= \mathtt{coord\_weight}(k) \cdot (k - \mathtt{offset}(k))\nonumber \\&= \mathtt{coord\_weight}(k) \cdot (k - (\mathtt{offset}(k)_{\textit{initial}} + \textit{mean}(k))),\quad \quad \end{aligned}$$
(9)

where \(\mathtt{offset}(k)_{initial}\) is zero, unless specified differently by the user. This approach realizes an equal weighting of all internal coordinates involved. Nonetheless, certain applications might call for biased weighting of the internal coordinates, e.g. when the distance between ligand and receptor (defined by linear internal coordinates) is to be stressed in comparison to more subtle conformational changes in the ligand molecule (defined by torsion angle internal coordinates).

2.3 Implementing the potential modification

Sampling the ZIBgridfree basis function \(\phi _i\) requires a modification of the potential function \(U(q)\) (Eq. 5). Our aim was to change the algorithm such that it can be run with standard force fields and unmodified molecular dynamics (MD) packages such as GROMACS[14]. Treating the MD code as a black box has several advantages: The user can use readily available software (pre-compiled for many Linux distributions and pre-installed on most computing clusters), and plug in new versions as they are released. Full flexibility regarding the choice of force field and other simulation parameters is sustained. Furthermore, internal changes to the highly optimized MD code, possibly having a negative impact on the simulation performance, are evaded.

Adapting ZIBgridfree to a standard MD package is a two-step procedure. First, for each selected node \(n_i\), the \(n_K\)-dimensional \(\phi _i\) function is projected on a single dimension by coordinate-wise evaluation: Instead of considering the joint distance \(\delta ^2(q,n_i)\) (involving all internal coordinates) we now exclusively consider the distance regarding coordinate \(k\):

$$\begin{aligned} \phi _{i_k}(q):=\frac{ \exp {(-\alpha \,\delta _k^2(q,n_i))} }{\sum _{j=1}^s{ \exp {(-\alpha \,\delta _k^2(q,n_j))} }},\ k=1,\ldots ,n_K. \end{aligned}$$
(10)

The above expression yields the membership of state \(q\) with respect to coordinate \(k\) regarding basis function \(\phi _i\). The one-dimensional penalty potential acting on coordinate \(k\) of state \(q\) can simply be obtained as:

$$\begin{aligned} \hat{U}_{i_k}(q)=-\frac{1}{\beta }\ln {(\phi _{i_k}(q))}. \end{aligned}$$
(11)

Finally, in order to approximate \(\hat{U}_i\), for every internal coordinate \(k\), a generic cubic restraint potential (as available in many common MD packages) is fitted to the penalty potential \(\hat{U}_{i_k}\) and added to the force field representing the unmodified potential \(U\). We implemented this approach for the GROMACS MD package, where restraint potentials of the form

$$\begin{aligned} U_{res}(\Phi ')={\left\{ \begin{array}{ll} \frac{1}{2}k_{res} \left( \Phi '-\Delta \Phi \right) ^2, &{} \text {for }\Phi ' > \Delta \Phi \\ 0, &{} \text {for }\Phi ' \le \Delta \Phi \end{array}\right. } \end{aligned}$$
(12)

are readily available (given here for a torsion angle restraint on torsion angle \(\Phi '=(\Phi _0-\Phi ) \mod 2\pi \), with rest position \(\Phi _0\) and unrestrained region \(\Delta \Phi \), analogous for distance restraints). The concept of fitting restraint potentials to the coordinate-wise projected basis function penalty potentials of ZIBgridfree is depicted in Fig. 1.

Fig. 1
figure 1

Sampling of a torsion angle distribution (gray histogram) with ZIBgridfree. The sampling is forced to stay within the area of an exemplary basis function (dashed gray line) by its penalty potential (dashed black line). For use with GROMACS, the penalty potential is approximated by a harmonic restraint potential (solid black line). Due to the approximation error, the sampling is not sufficiently limited to the area covered by its basis function (left). After reweighting the sampling points with regard to the their basis function (right), the approximation error is removed

The imperfect approximation of multi-dimensional basis functions by harmonic restraints introduces a certain error, as sampling points may be generated from areas of \(\Omega \) that are not covered by the basis function in question. This is especially true for boundary regions, where several basis functions are overlapping. This approximation error can be removed by giving each sampling point \(q\) a weight \(\mathtt{frame\_weight}_i(q)\) with respect to basis function \(\phi _i\):

$$\begin{aligned} \mathtt{frame\_weight}_i(q)=\frac{\phi _i(q)}{\exp (-\beta \cdot U_{res}(q))}. \end{aligned}$$
(13)

The effect of reweighting on the sampling distribution is depicted in Fig. 1. Calculating the sampling point weights is inexpensive in terms of computation time. Subsequently, when checking for convergence of the sampling, or when calculating observables of any kind, only the reweighted distribution is considered.

2.4 Adaptive refinement of the partitioning

In order to ascertain a sufficient sampling of the partial densities \(\rho _i\), ZIBgridfree pursues an adaptive refinement approach. After a certain number of simulation steps, convergence of the sampling is tested by evaluating the variance-based Gelman-Rubin convergence criterion [15]. If the convergence test fails, the sampling will be extended by \(n\) simulation steps (followed by another convergence test) for a maximum of \(m\) times (where \(n\) and \(m\) are user-defined settings). If convergence has not been achieved after \(m\) extensions of the original sampling length, a refinement of the partitioning in the area of the affected basis function is triggered. By default, two children nodes \(n_{i_1}\) and \(n_{i_2}\) are introduced, whereas the original parent \(n_i\) is removed from the partitioning, along with its basis function \(\phi _i\). This principle is illustrated in Fig. 2.

Fig. 2
figure 2

The sampling of basis function ‘1’ (associated with a node at \(-68^{\circ }\)) has come upon a second minimum in the region around \(-180^{\circ }\) (left). In this case, convergence of the sampling is not achieved in the allocated number of sampling steps. A failed convergence test triggers an automatic refinement of the partitioning (right). The parent node ‘1’ is removed and replaced by two children named ‘7’ (\(-65^\circ \)) and ‘8’ (\(-167^\circ \)). The samplings of the associated basis functions converge quickly, as they are now confined to a single energy minimum each

Removal and addition of nodes have an impact on the overall partitioning, as with the number of nodes \(s\), the mean node distance \(\theta \) is bound to change. Hence, the shape parameter \(\alpha \) (Eq. 6) is recalculated following each refinement step. With proceeding refinement and increasing \(s, \alpha \) will become larger, which in turn leads to a higher degree of separation between basis functions. This mechanism leads to increased convergence rates over the course of the refinement.

Despite several cycles of refinement, the sampling of transition regions (e.g. when a node is situated on the steep flank of a potential energy barrier) may not lead to convergence according to the Gelman-Rubin criterion. In these cases, the sampling has to be discontinued as soon as a sufficient number of data points from the transition region has been collected.

2.5 Reweighting and cluster analysis

2.5.1 Direct free energy reweighting

The local confined samplings are distributed according to

$$\begin{aligned} \rho _i =\frac{\phi _i \cdot \rho }{w_i}. \end{aligned}$$

If we can calculate the terms \(w_1,\dots ,w_s\) we can approximate the correct Boltzmann distribution by weighting the local histogram of \(\rho _i\). The correct weighting is given through

$$\begin{aligned} \sum _{i=1}^s w_i \rho _i = \sum _{i=1}^s \phi _i \rho = \rho \end{aligned}$$

since the \(\phi _i\)’s sum up to one. The partition of unity assures that the passage between the overlapping subsets is described correctly. We remark that this partition of the conformation space is for the purpose of efficiency only and has thus no real physical or chemical meaning. In order to get the “true” global distribution we thus have to account for these local restraints, since otherwise spurious effects might occur which is illustrated in Fig. 3 for the torsion angle distribution of \(n\)-pentane. In order to arrive at a balanced joint Boltzmann distribution, we need to find the correct \(w_i\). This is done with the free energy difference estimate implemented in the tool zgf_reweight, based on the approach of Klimm et al. [16]. This approach, which is not dependent on explicit overlap between the partial densities, is outlined shortly in the following. In principle, other methods for thermodynamic reweighting, such as the popular weighted histogram analysis method (WHAM) [17, 18], can be employed as well.

Fig. 3
figure 3

Torsion angle distribution of the two torsion angles of \(n\)-pentane at 300 K, assembled from 25 individual node samplings. Before reweighting, each partial density contributes equally to the joint distribution (left). This leads to disproportionately high weights of the gauche/trans, trans/gauche and gauche/gauche conformations. After thermodynamic reweighting, the correct relative weights of the partial densities are restored, which leads to an improved joint distribution (right)

  1. 1.

    From each set of states \(\lbrace q_n^{(i)} \rbrace _{n=1,\ldots ,N^{(i)}} \in \Omega \) representing the partial density \(\rho _i, i=1,\ldots ,s\), choose a set of reference points \(\lbrace q_r^{(i)} \rbrace _{r=1,\ldots ,R^{(i)}}\). A reference point is characterized by having a potential energy value within the energy standard deviation of \(\rho _i\). More precisely, with \(\langle U^{(i)} \rangle \) being the mean potential energy of set \(q^{(i)}\),

    $$\begin{aligned} \left\| U(q_r^{(i)}) - \langle U^{(i)} \rangle \right\| \le \sqrt{ \frac{ 1 }{ N^{(i)} } \sum _n^{N^{(i)}} \left( U(q_n^{(i)}) - \langle U^{(i)} \rangle \right) ^2 }. \end{aligned}$$
  2. 2.

    Approximate the local density of sampling points by evaluating expression \(D_{vol_i}\), which counts the number \(N_{near}^{(i)}\) of sampling points that are near, i.e. within a certain distance \(vol_i\) around each reference point \(q_r^{(i)}\), and compute its inverse

    $$\begin{aligned} \left( D_{vol_i}( q_r^{(i)} ) \right) ^{-1} \approx \frac{ N^{(i)}}{ N_{near}^{(i)} + 1 }. \end{aligned}$$

    For our purpose, \(vol_i\) is chosen as large as the mean variance of the internal coordinates regarding all sets of states \(q^{(i)}\), which is precomputed in a first iteration over the sampling data. The variance for each set is computed in terms of the distance function \(\delta ^2\), dependent on the type of the internal coordinates that are involved in the discretization.

  3. 3.

    Compute the entropy estimate

    $$\begin{aligned} S_i = k_B \ln \left( \frac{ 1 }{ R^{(i)} } \sum _{l=1}^{R^{(i)}} \left( D_{vol_i}( q_r^{(i)} ) \right) ^{-1} \right) , \end{aligned}$$

    the free energy

    $$\begin{aligned} G_i = \langle U^{(i)} \rangle - T \cdot S_i, \end{aligned}$$

    and the statistical weights

    $$\begin{aligned} w_{i} = w_{i-1} \cdot \exp \left( -\beta \left( G_{i} -G_{i-1} \right) \right) , \end{aligned}$$

    with \(w_1=1\). The free energy values have to be ordered by size before calculating the statistical weights. Finally, the statistical weights have to be normalized so that \(\sum _{i=1}^{s} w_i = 1\).

2.5.2 Overlap weight correction

The reweighting method introduced in the previous section works best for well-separated basis functions. Depending on the given discretization and the nature of the system under observation, the basis functions in ZIBgridfree can have a more or less pronounced overlap. We perform a correction of the statistical weights \(w_i\) in order to take basis function overlap into account. The degree of overlap between each pair of basis functions \(\phi _i\) and \(\phi _j\) is quantified in terms of the overlap integral matrix \(S \in \mathbb {R}^{s \times s}\):

$$\begin{aligned} S_{ij} = \int \limits _{\Omega } \phi _i(q) \rho _j(q) \, dq, \end{aligned}$$
(14)

which for large numbers is approximated as

$$\begin{aligned} S_{ij} = \frac{ 1 }{ N^{(i)} } \sum _{n=1}^{N^{(i)}} \phi _j(q_n^{(i)}) \cdot \mathtt{frame\_weight}_i(q_n^{(i)}) \end{aligned}$$
(15)

from the states \(\lbrace q_n^{(i)} \rbrace _{n=1,\ldots ,N^{(i)}}\) that represent the partial density \(\rho _i\). Note that the shape of \(S\) is influenced by the chosen discretization, in particular by the number of discretization nodes \(s\). For fine discretizations (large \(\alpha \), cp. Eqs. 6 and 7), \(S\) will resemble a diagonal matrix. For very coarse discretizations and small \(\alpha \), it will degenerate into a full matrix.

The statistical weights \(w\) of the basis functions can be derived by solving the eigenvalue problem \(w^{\top }S = w^{\top }\), which means that \(w\) corresponds to the unique, positive and normalized left eigenvector of \(S\) with regard to its eigenvalue \(\lambda _1 = 1\) [11]. This eigenvector-based approach is not well-conditioned and highly dependent on sufficient sampling in the overlap regions between the basis functions [19]. In order to benefit from the advantages of both direct free energy reweighting and the eigenvector-based approach, we start a number of power iteration steps from the original weights \(w\) with the stochastic matrix, until the corrected weights (again denoted as \(w\)) are convergent.

The row sums of the matrix \(S\) do not correspond to the corrected weights \(w\). According to the method of Sinkhorn[20], an iterative rescaling of the row sums to meet \(w\), followed by a symmetrization of \(S\), leads to a corrected overlap integral matrix that is consistent with the precomputed statistical weights.

2.5.3 Metastability analysis with PCCA+

From the chemical perspective, metastable subsets correspond to the main conformations of the underlying molecular system. In the presence of metastable states, any matrix describing the transition behavior of the system (including the matrix \(S\)) exhibits a virtual block-diagonal structure, i.e. there exists a permutation of indices so that the metastable subsets of the system are represented by (more or less) quadratic blocks along the diagonal of the matrix (see Fig. 4).

Fig. 4
figure 4

Schematic of a (permutated) transition matrix in the presence of metastable subsets. Within the three conformations \(c_1\) to \(c_3\), states are mixing quickly. By contrast, transitions from conformation to conformation (light gray off-diagonal area) are rare events

Every block in this matrix is associated with an eigenvector of the matrix whose eigenvalue is almost one. The set of the eigenvalues in the vicinity of one is denoted as the Perron cluster, and the size of this set corresponds to the number of chemical conformations \(n_C\). The linear combinations of the eigenvectors associated with the eigenvalues of the Perron cluster contain, for each basis function \(\phi _i\), the degree of membership with regard to each of the \(n_C\) conformations. Robust Perron cluster analysis (PCCA+) [21, 22] is used to find the permutation yielding the block-diagonal structure, and hence the matrix of linear combination factors \(\chi _{disc}\) (cp. Eq. 4). The result is the matrix \(\chi \in \mathbb {R}^{s \times n_C}\), where the entry \(\chi (i,j) \in \left[ 0,1 \right] \) denotes the degree of membership of basis function \(\phi _i\) with regard to the \(j\)th metastable subset.

Using the weight vector \(w\) containing the thermodynamic weights of the basis functions \(\phi _i\), it is then possible to calculate the weights \(\tilde{w}\) of the conformations as \(\tilde{w} = \chi ^{\top }w\).

3 Molecular simulation details

All molecular simulations were performed with GROMACS, versions 4.5.4 and 4.5.5 (single precision, unless stated differently). All molecules were parametrized for the Amber-99SB force field [23]. Residues not already included in the standard force field were prepared using the software ACPYPE [24] and Antechamber [25, 26] from AmberTools [27], with charges calculated by the AM1-BCC method [28, 29].

For the vacuum simulations (\(n\)-pentane), van der Waals and Coulomb interactions were computed without cut-off (all vs. all). For the explicit solvent alanine dipeptide simulations, the TIP4P-Ew water model [30, 31] was used. The solute was placed in a rhombic dodecahedron periodic box of 4.0 nm side length. The host-guest system structure in non-complexed form (with the guest molecule displaced by 1.5 nm) was placed in a cubic periodic box of 6.5 nm side length and solvated in a 10:1 mixture of chloroform and methanol. The force field parameters for chloroform and methanol were obtained from the GROMACS Molecule & Liquid Database at URL http://virtualchemistry.org/gmld.php [32, 33]. To neutralize the overall charge, a single counter ion was added to the simulation box. In both cases, a twin range cut-off of 1.0/1.4 nm for van der Waals interactions was applied and the smooth particle mesh Ewald algorithm [34] was used for Coulomb interactions, with a switching distance of 1.0 nm.

In order to generate the \(NVT\) ensemble of states for the desired temperature of 298/300 K, either the velocity-rescaling thermostat [35] in combination with an MD leap-frog integrator, or a Langevin-type stochastic dynamics [36] integrator was used. For the explicit solvent \(NpT\) simulations (alanine dipeptide), the velocity-rescaling thermostat/stochastic dynamics integrator was supplemented by the Parrinello-Rahman barostat [37, 38], with a reference pressure of 1 bar. For the host-guest system transition node samplings, neither thermostat nor barostat were applied in order to realize an \(NVE\) ensemble setup. The integration step was set to 1 fs for all simulations. The error threshold for the symmetrization of the \(S\) matrices was set to \(10^{-2}\) for \(n\)-pentane, to \(10^{-4}\) for alanine dipeptide, and to \(10^{-3}\) for the host-guest system.

4 Results and discussion

4.1 Pentane in vacuo

In order to evaluate basic properties of the algorithm, vacuum simulations of \(n\)-pentane, a small alkane with five carbon atoms (see Fig. 5), were conducted. The two backbone torsion angles of \(n\)-pentane were chosen as internal coordinates for the discretization. With regard to these internal coordinates, \(n\)-pentane has nine main conformations, separated by distinct energy barriers. The presampling of conformational space was obtained in terms of a 100 ns MD simulation at a very high (and physically unrealistic) temperature of 1,000 K. Reference weights for the conformations of \(n\)-pentane were taken from the literature [39] (see Table 1).

Table 1 Conformational weights of \(n\)-pentane at 300 K, derived from a hybrid Monte Carlo (HMC) simulation using the Merck molecular force field [40]
Fig. 5
figure 5

Three-dimensional representation of \(n\)-pentane. The two backbone torsion angles chosen as internal coordinates are highlighted

4.1.1 Stability regarding randomness of impulse and discretization

In order to monitor the impact of choosing a different discretization (placing of nodes in conformational space) on the sampling outcome, three experiments with ten runs of ZIBgridfree each were conducted: a) Equally placed nodes, but random MD starting impulse, b) randomly placed nodes, but equal MD starting impulse, and c) randomly placed nodes and random MD starting impulse. All runs were conducted with 20 discretization nodes and a minimum sampling time of 100 ps per node, leading to a mean overall sampling time per run of 2.8 ns. The results are shown in Fig. 6, left.

Fig. 6
figure 6

Conformational weights of \(n\)-pentane. Error bars indicate the standard deviation w.r.t. 10 runs. Deviation from the literature values is indicated as intra-bar plot. Left 20 nodes, 100 ps minimum sampling time per node, with equally placed nodes, random MD starting impulse (dark gray), randomly placed nodes, equal MD starting impulse (gray), and randomly placed nodes and random MD starting impulse (light gray). Right 100 ps minimum sampling time per node, 10, 20, 30 and 40 nodes (dark gray to light gray), random MD starting impulse, random node placement

Randomizing the MD starting impulse leads to a maximum standard deviation of 0.025 regarding the weight of the most dominant conformation, tr/tr. Randomizing the node placement by picking different initial seeds for the \(k\)-means algorithm leads to a maximum standard deviation of 0.031 for conformation tr/tr. When both MD starting impulse and node placement are randomized at the same time (mimicking a standard sampling setup), the maximum standard deviation is slightly smaller (0.23 for conformation tr/tr), which indicates that the uncertainty regarding both choices is not additive.

4.1.2 Stability regarding fineness of discretization

Similar simulations (random MD starting impulse, random node placement, 100 ps minimum sampling time per node) were performed with varying number of sampling nodes in order to evaluate the impact of the fineness of the discretization. For this experiment, automatic refinement of the discretization was switched off. The results are shown in Fig. 6, right. When only ten discretization nodes are used (only one more than the expected number of conformations), the error becomes very large (0.128 for conformation tr/tr), and, despite a relatively large mean overall sampling time of 3.2 ns per run, the rare conformations g+/g- and g-/g+ are not identified at all. For 20, 30 and 40 discretization nodes (mean overall sampling times 2.79, 4.45 and 5.5 ns per run), the results are comparable, but do not improve visibly with increasing fineness of the discretization.

4.1.3 Stability regarding sampling time

Finally, it was looked into how the sampling time per node determines the quality of the results. The outcome is shown in Fig. 7. A very short minimum sampling time of 10 ps per node produces a large error (0.099 for conformation tr/tr), but, given the mean overall sampling time of only 365 ps per run, the averaged conformational weights are acceptable. With increasing sampling time per node, the error can be significantly reduced. For a minimum sampling time of 1,000 ps per node (mean overall sampling time 26.7 ns), the maximum standard deviation (conformation tr/tr) is reduced to 0.016, and below one percent for all other conformations. One can conclude that a rough estimate of the conformational weights can be obtained at a very low cost, whereas precise results have to be paid for with thorough sampling of the partial densities.

Fig. 7
figure 7

Conformational weights of \(n\)-pentane. Error bars indicate the standard deviation w.r.t. 10 runs. Deviation from the literature values is indicated as sub-bar plot. 25 nodes, with 10, 100 and 1,000 ps minimum sampling time per node (dark gray to light gray), random MD starting impulse, random node placement

The results show a perceivable deviation w.r.t. to the conformational weights found in the literature (cp. Table 1), which most likely can be attributed to the use of a different force field and (possibly) the different dynamics for propagating the system. For comparison, the conformational weights obtained from ZIBgridfree with 25 nodes and 1,000 ps minimum sampling time per node, averaged over ten runs, are given in Table 2.

Table 2 Averaged conformational weights of \(n\)-pentane at 300 K, derived from ten runs of ZIBgridfree using the Amber-99SB force field

4.2 Alanine dipeptide in water

As a second example, the conformations of alanine dipeptide in explicit TIP4P-Ew water were studied. Alanine dipeptide is the most basic (or “minimal”) polypeptide and serves as a popular test case for evaluating biological force fields. The two backbone torsion angles \(\Phi \) and \(\Psi \) span the relevant conformational space of alanine dipeptide, and were hence chosen as internal coordinates for the discretization. With regard to these internal coordinates, alanine dipeptide has six main conformations, which however are not as well-separated as in the previous example, \(n\)-pentane. Obtaining correct conformational weights from explicit solvent simulations is more difficult compared to vacuum or implicit solvent settings, as the dynamics of a solvated system is decelerated, while the computational cost of producing sufficient sampling data multiplies (Fig. 8).

Fig. 8
figure 8

Three-dimensional representation of alanine dipeptide (ACE-ALA-NME, i.e. terminally blocked alanine). The two backbone torsion angles \(\Phi \) and \(\Psi \) chosen as internal coordinates are highlighted

Reference weights for the conformations of alanine dipetide at 300 K in the \(NVT\) and in the \(NpT\) ensemble were obtained from two 200 ns MD simulations (see Table 3).

Table 3 Conformational weights of alanine dipeptide at 300 K in the \(NVT\) and in the \(NpT\) ensemble, derived from two 200 ns MD simulations using the Amber-99SB force field

Explicitly modeled water also complicates the presampling of conformational space: High (or elevated) temperature presampling is possibly only to a certain extent, and requires a re-equilibration of the simulation boxes before the partial densities can be sampled at the target temperature. In principle, discretization nodes can also be picked from a vacuum or implicit solvent trajectory of the molecule of interest, to be put in explicit solvent only before the sampling of partial densities with ZIBgridfree is commenced (implemented in the tools zgf_solvate_nodes and zgf_genion). Again, another cycle of energy minimization and simulation box equilibration is needed before usable sampling data can be collected. For this example, the presampling consisted of a 100 ns MD trajectory at the target temperature of 300 K, which means that re-equilibration after node selection was not necessary.

4.2.1 Stability regarding sampling time

First, it was looked into how the sampling time per node determines the quality of the results using random MD starting impulse and random node placement in an \(NVT\) ensemble. The outcome is shown in Fig. 9. In comparison to the (vacuum) \(n\)-pentane example, a longer minimum sampling time per node is required in order to yield acceptable results. For a very short minimum sampling time of 10 ps per node, the results were not interpretable due to the large error (data not shown). A minimum sampling time of 100 ps per node (mean overall sampling time 2.4 ns) produces large errors of around 15 % in terms of standard deviation for the three largest conformations \(P_{II}, C_5\) and \(\alpha _R\). When the minimum sampling time per node is increased to 500 ps (mean overall sampling time 7.7 ns), the error can be reduced below 6 % for all conformations (largest error is 0.0581 for conformation \(P_{II}\)). Finally, with a minimum sampling time of 1,000 ps per node (mean overall sampling time 15 ns), the error is in the range of 5 %, and mainly below (largest error is 0.0533 for conformation \(P_{II}\)).

Fig. 9
figure 9

Conformational weights of alanine dipeptide. Error bars indicate the standard deviation w.r.t. 10 runs. Deviation from the reference values is indicated as sub-bar plot. 15 nodes, with 100, 500 and 1,000 ps minimum sampling time per node (dark gray to light gray), including an auxiliary 1,000 ps double precision trial, random MD starting impulse, random node placement

An auxiliary trial with a minimum sampling time of 1000 ps per node (mean overall sampling time 15.56 ns) using a double precision version of GROMACS did not lead to a further decrease in standard deviation, contrary to what might have been expected from an increase in precision of coordinates and observables.

4.2.2 Stability regarding choice of dynamics

Second, similar simulations (random MD starting impulse, 15 randomly placed nodes, 1000 ps minimum sampling time per node) were performed while exchanging the common MD integrator with a stochastic dynamics (SD) integrator. Both integrators were compared in the context of an \(NVT\) and an \(NpT\) ensemble, the latter realized by using a Parrinello-Rahman barostat. All trial runs were conducted with a double precision version of GROMACS. The results are shown in Fig. 10. In both \(NVT\) and \(NpT\) ensemble, the SD integrator delivers better results with regard to the standard deviation over ten runs. In the \(NVT\) ensemble, the largest error obtained with the SD integrator is 3.618 % (conformation \(P_{II}\)), compared to 5.86 % when the MD integrator is used (conformation \(\alpha _{R}\)). This gap becomes somewhat closer in the \(NpT\) ensemble, where the largest error obtained with the SD integrator is 5.35 %, compared to 6.3 % when the MD integrator is used (both w.r.t. conformation \(\alpha _{R}\)).

Fig. 10
figure 10

Conformational weights of alanine dipeptide. Error bars indicate the standard deviation w.r.t. 10 runs. Deviation from the reference values is indicated as sub-bar plot. 15 nodes, 1,000 ps minimum sampling time per node, random node placement, with MD integrator (\(NVT\)), SD integrator (\(NVT\)), MD integrator (\(NpT\)), and SD integrator (\(NpT\)), dark gray to light gray

The chosen dynamics also has an impact on the mean conformational weights. When the SD integrator is used, the largest conformation, \(P_{II}\) is sampled less dominant than with the MD integrator (\(NVT\): 36.18 % compared to 39.68 %, and \(NpT\): 39.02 % compared to 44.93 %). Instead, the conformational weight is distributed more equally over the minor conformations \(\alpha _R, \alpha _P\) and \(\alpha _L\).

The results show an acceptable agreement with the reference weights that were extracted from the 200 ns MD trajectory for all runs using 500 ps or more minimum sampling time per node, at least for the runs conducted with the MD integrator (i.e. the same integrator that was used for the long-time trajectories used as reference). Long-time data from the SD integrator is not available, but it can be expected to deliver a slightly different distribution. In general, the largest deviation is found for the \(\alpha _R\) conformation: ZIBgridfree tends to overweight \(\alpha _R\) by about 4 %, a weight that is mostly drawn from the \(\alpha _P\), and partly from the \(\alpha _L\) conformation. As the conformations of alanine dipeptide tend to have notable overlapping regions (as opposed to the well-separated conformations of \(n\)-pentane), the error might not only be due to insufficient sampling, but also to imperfect clustering of certain states in transient regions. For comparison, the conformational weights in the \(NVT\) and the \(NpT\) ensemble, obtained from ZIBgridfree with 15 nodes and 1000 ps minimum sampling time per node and averaged over ten runs, are given in Table 4.

Table 4 Averaged conformational weights of alanine dipeptide at 300 K in the \(NVT\) and in the \(NpT\) ensemble, derived from ten runs of ZIBgridfree using the Amber-99SB force field (MD integrator, double precision GROMACS)

4.3 Host-guest binding process in explicit solvent

In order to give a proof of concept for a different application of the algorithm, the analysis of a small crown ether-ammonium host-guest binding process is presented in the following. The system consists of an 18-crown-6 dimer host molecule (C6), and an ammonium ion guest molecule incorporating a short flexible tail (MonoG1+H). The thermodynamics of the formation of complex (MonoG1+H)\(\bullet \) C6 in a mixture of chloroform and methanol and in the presence of tosylate counter ions (denoted as OTs) could be characterized recently, along with an analoguous bivalent system [42].

The presampling for this system was obtained by free diffusion MD simulations involving the complete explicit solvent and counter ion setup. One out of five 10 ns MD simulations starting from the unbound state (Fig. 11, left) with about 1.5 nm separation between host and guest molecule captured a binding event. The relatively low yield can be explained by the fact that (i) both host and guest molecule are rather small and mobile and therefore subject to rapid diffusion in the box and (ii) the complexation of host and guest is hindered by the counter ion associating with the ammonium moiety, obscuring the interaction site. Consequently, not every close contact between host and guest immediately induces complex formation. The trajectory which captured the binding event was prolonged to a total of 100 ns without showing indications for complex dissociation.

Fig. 11
figure 11

Left Host guest system C6-(MonoG1+H)-OTs after 2 ns equilibration of the solvent mixture (10:1 chloroform-methanol) in the position-restrained unbound state at 298 K (chloroform = gray, methanol = purple). Polar clusters of methanol molecules are clearly visible. Right Three distances between ammonium moiety and binding site form the internal coordinates for the system (Color figure online)

4.3.1 Discretization and metastability analysis

The conformational space discretization was based on a set of internal coordinates consisting of three strongly correlated distances between ammonium moiety and 18-crown-6 ring (Fig. 11, right). In order to remove the abundance of unbound states not related to the binding process from the presampling data, states with distances of more than 1.8 nm distance between the interaction sites were discarded. A total number of 16 discretization nodes was placed equidistantly in the remaining part of conformational space. For each discretization node, 5 \(\times \) 500 ps of MD in the \(NVT\) ensemble were simulated at a temperature of 298 K, with each 500 ps run starting at the initial position of the discretization node using a random starting impulse vector, leading to a joint sampling time of 40 ns for the complete discretization.

The thermodynamic reweighting of the partial distributions sampled for the 16 discretization nodes documents a decrease in potential energy that is directly related to the distance of the host to the guest molecule (Fig. 12, left). A notable improvement in the interaction energy sets in with node 11 at an approximate host-guest distance of 7.5 Å, and culminates in the bound state (nodes 14, 15 and 16). While nodes 1–10 have similar (and low) thermodynamic weights, nodes 11–16, covering host-guest distances of 7.5 Å and nearer, represent the largest share of the distribution.

Fig. 12
figure 12

Left Mean potential energy (gray) and corrected discretization node weights (blue) for the 16 discretization nodes of system C6-(MonoG1+H)-OTs. Right Overlap integral matrix \(S\) with 16 discretization nodes. Large matrix entries (red, yellow) indicate no or only minor overlap with neighboring discretization nodes and represent isolated and/or stable regions. Discretization nodes with many off-diagonal entries (blueish) exhibit a significant overlap with their neighborhood and thus mark transient regions (Color figure online)

The \(S\) matrix of the discretization (Fig. 12, right) exhibits an isolated unbound state represented by node 1, an articulate “block” for the bound state (nodes 14, 15 and 16 in the lower right corner) and a large transition region in between. Accordingly, the clustering with PCCA+ identifies three metastable states, namely the unbound state (UB) with a weight of 4.58 %, the almost bound state (AB) with a weight of 9.04 %, and the bound state (SB) with a weight of 86.38 % (Fig. 13). State UB is detached from the rest of the system except for a small degree of communication involving nodes 2 and 3 that leads into state AB. State AB, in turn, exhibits a fluent transition into state SB. Nodes 6, 9, and in particular 10 mark the transition region between the two clusters AB and SB. Nodes 14, 15 and 16 have the highest membership w.r.t. to state SB, and represent the proper bound state.

Fig. 13
figure 13

The \(\chi ^{\top }\) matrix (top) of system C6-(MonoG1+H)-OTs groups the 16 discretization nodes into three metastable states: The unbound state (UB, left), the almost bound state (AB, center), and the singly bound state (SB, right). The colors in the \(\chi ^{\top }\) matrix indicate the degree of membership of a discretization node to a given metastable state: dark red = highest degree of membership, dark blue = no membership. Nodes 6, 9 and 10 represent transition regions that belong almost evenly to the two metastable states AB and SB (Color figure online)

In order to look into the transition behavior on the level of the metastable states, additional unrestrained short-time MD simulations in the \(NVE\) ensemble were conducted. The unrestrained “transition nodes” (as opposed to the discretization nodes used for sampling the stationary distribution) were placed in regions of conformational space that mark interfaces between the different metastable states, and thus are prone to reveal the associated transition behavior more readily than simulations that are started exactly within the center of a metastable region.

(16)

Using a total of 45 transition nodes started for ten runs of 100 ps each using a random starting impulse (45 ns additional sampling time), the transition probability matrix \(P_c(\tau )\) is obtained (Matrix 16). Within the short time span of 100 ps, the system has a very high probability to remain in either state UB or state SB. Given the system is in state AB, it is more likely to make the transition into the bound state (\(\approx \)19 %) than into the unbound state (\(\approx \)5 %).

5 Conclusion

As far as the limited number of test cases allows, it was shown that algorithm and software perform reasonably well in determining the conformational weights and inter-conformational transition probabilities of small molecular systems in both vacuum and explicit solvent. The performance of the method in comparison to other approaches was not evaluated explicitly, but, given that a similar algorithmic framework is used, should be in the order of available umbrella sampling approaches. Due to the fact that ZIBgridfree is dependent on the availability of a presampling of conformational space from which discretization nodes can be selected, the cost of obtaining the presampling would have to be added to the overall performance balance. The cost of generating an adequate presampling is dependent on the system in question. For instance, a series of docking poses of a small molecule in a protein binding pocket would also serve as a valid starting point for using ZIBgridfree.

Given the efficiency of current MD code in generating even very long trajectories, the need for a relatively complex algorithm like ZIBgridfree can be questioned. We see the advantage of using ZIBgridfree mainly in the more directed generation of sampling data in transient regions of conformational space (reducing the amount of redundant sampling data) and the possibility to add another level of parallelization to the sampling process, namely parallel sampling of the discretization nodes (i.e. conformational space regions), which can be used to complement the parallel force field evaluation in order to increase the overall sampling efficiency. Furthermore, the use of collective variables (i.e. internal coordinates) and the integrated clustering approach lead to a level of abstraction that significantly facilitates the analysis of the sampling data, the identification of relevant events and their biological or chemical interpretation.

In upcoming work, we would like to improve the usability of software and algorithm. In particular, we would like to eliminate certain discretization parameters that currently have to be set by the user. Ideally, for a given system, an optimal number of discretization nodes is proposed beforehand. The ZIBgridfree scheme is also a suitable discretization of the infinitesimal generator described in [43]. Further invesigation in this direction will also be done in future.

6 Supporting information available

The source code of ZIBgridfree is available at https://github.com/CMD-at-ZIB/ZIBMolPy.