Uncertainty Estimation

Noé, Frank; Chodera, John D.

doi:10.1007/978-94-007-7606-7_5

Frank Noé⁴ &
John D. Chodera⁵

Part of the book series: Advances in Experimental Medicine and Biology ((AEMB,volume 797))

4625 Accesses
1 Citations

Abstract

Markov models provide a useful simplified representation for characterizing the long-time statistical evolution of biomolecules in a manner that allows direct comparison with experiments as well as the elucidation of mechanistic pathways for an inherently stochastic process. A vital part of meaningful comparison with experiment is the characterization of the statistical uncertainty in the predicted experimental measurement, which may take the form of an equilibrium measurement of some spectroscopic signal, the time-evolution of this signal following a perturbation, or the observation of some statistic such as the correlation function of the equilibrium dynamics of a single molecule. Without meaningful error bars which arise from both approximation and statistical error, there is no way to determine whether the deviations between model and experiment are statistically meaningful. In this chapter, we describe several approaches for computing statistical uncertainty of the estimated transition matrix and quantities calculated from it.

Reprinted with permission from Prinz et al. (Markov models of molecular kinetics: Generation and validation. J. Chem. Phys. 134:174,105, 2011), Noé (Probability Distributions of Molecular Observables computed from Markov Models. J Chem Phys 128:244,103, 2008), Chodera and Noé (Probability distributions of molecular observables computed from Markov models. ii: Uncertainties in observables and their time-evolution. J Chem Phys 133:105,102, 2010). Copyright 2008, 2010, 2011, American Institute of Physics.

Access provided by Autonomous University of Puebla. Download chapter PDF

Computing Ensembles of Transitions with Molecular Dynamics Simulations

Comments on “A critical appraisal of Markov state models” by M. Sarich and C. Schütte

Article 21 September 2015

From statistical thermodynamics to molecular kinetics: the change, the chance and the choice

Article 25 October 2018

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

As only a finite quantity of data can be collected for the construction of Markov state models, the parameters characterizing the model and any properties computed from it will always be statistically uncertain. This chapter is concerned with the quantification of this statistical uncertainty, and its use in validation of model quality and prediction of properties using the model. In the following sections we proceed along Refs. [2, 7, 11] which should be used for reference purposes.

5.1 Uncertainties in Transition Matrix Elements

We first consider the uncertainty in the transition matrix T(τ) itself estimated from a finite quantity of data. It may be the case that the uncertainty in individual elements T _ij(τ) may be of interest, in which case standard errors or confidence intervals of these estimates may be sufficient tools to quantify the uncertainty.

For a transition matrix estimated without the detailed balance constraint, the expectation and variance of individual elements follow from well-known properties of the distribution of stochastic matrices [1]. These uncertainties do, however, depend on the choice of prior used in modeling the full posterior for the transition matrix (Sect. 4.4). Under a uniform prior, the expectation and variance of an individual element T _ij is given by,

$$\begin{aligned} \mathbb{E}[T_{ij}] =& \frac{c_{ij}+1}{c_{i}+n} \equiv\bar{T}_{ij}, \end{aligned}$$

(5.1)

$$\begin{aligned} \operatorname{Var}[T_{ij}] =& \frac {(c_{ij}+1)((c_{i}+n)-(c_{ij}+1))}{(c_{i}+n)^{2}((c_{i}+n)+1)} \\ =&\frac {\bar{T}_{ij}(1-\bar{T}_{ij})}{c_{i}+n+1} , \end{aligned}$$

(5.2)

where c _ij and c _i are the elements and row sums, respectively, of the observed count matrix C ^obs (Sect. 4.2).

To see the effect that the choice of prior has on the computed uncertainties, consider a trajectory of a given molecular system which is analyzed with two different state space discretizations. Assume one discretization uses n=10 states, and the other n=1000. Assume that a lag time τ has been chosen which is identical and long enough to provide Markov models with small discretization error for both n (as suggested in Sect. 4.7). With a uniform prior ($c_{ij}=c_{ij}^{\mathrm{obs}}$), the posterior expectation $\bar{T}_{ij}$ would be different for the two discretizations: While in the n=10 case we can get a distinct transition matrix estimation, in the n=1000 case, most c _ij are probably zero and c _i≪n, such that the expectation value would be biased towards the uninformative T _ij≈1/n±1/n matrix, and many observed transitions would be needed to overcome this bias. This behavior is undesirable. Thus, for uncertainty estimation it is suggested to use a prior which allows the observation data to have more impact also in the low-data regime.

On the other hand, the “null prior” [10] defined by

$$ c_{ij}^{\mathrm{prior}}\rightarrow-1\quad\forall i,j\in\{1,\ldots,n\}, $$

(5.3)

leans to the other extreme. Under the null prior, the expectation and the variance of the marginalized posterior for a single T _ij become,

$$\begin{aligned} \bar{T}_{ij} = & \mathbb{E}[T_{ij}]=\frac{c_{ij}^{\mathrm{obs}}}{c_{i}^{\mathrm{obs}}}= \hat{T}_{ij}, \end{aligned}$$

(5.4)

$$\begin{aligned} \operatorname{Var}(T_{ij}) = & \frac{c_{ij}^{\mathrm{obs}}(c_{i}^{\mathrm{obs}}-c_{ij}^{\mathrm{obs}})}{(c_{i}^{\mathrm{obs}})^{2}(c_{i}^{\mathrm{obs}}+1)} \\ =&\frac{\hat{T}_{ij}(1-\hat{T}_{ij})}{c_{i}^{\mathrm{obs}}+1}. \end{aligned}$$

(5.5)

Thus, with a null prior, the expectation value is located at the likelihood maximum. Both expectation value and variance are independent of the number of discretization bins used. The variance of any T _ij asymptotically decays with the number of transitions out of the state i, which is expected for sampling expectations from the central limit theorem.

5.2 Uncertainties in Computed Properties

In practice, one is often not primarily interested in the uncertainties of the transition matrix elements themselves, but rather in the uncertainties in properties computed from the transition matrix. Here, we review two different approaches for this purpose.

Linear error perturbation [4, 12, 13]. Here, the transition matrix posterior distribution is approximated by a multivariate Gaussian, and the property of interest—taken to be a function of the transition matrix or its eigenvalues and eigenvectors—is approximated by a first-order Taylor expansion about the center of this Gaussian. This results in a Gaussian distribution of the property of interest, with a mean and a covariance matrix that can be computed in terms of the count matrix C. This approach has the advantage that error estimates and their rates of reduction for different sampling strategies can be computed through a direct procedure. As a result, it is convenient for situations where uncertainty estimates are used as part of an adaptive sampling procedure [4, 8, 9, 13]. The disadvantage of this approach is that the Gaussian approximation of the transition matrix posterior in only asymptotically correct, and can easily break down when few counts have been observed. In the low-data regime, the resulting Gaussian distribution for the property of interest often gives substantial probability to unphysical or meaningless values, such as when transition matrix elements T _ij are allowed to assume values outside the range [0,1]). Moreover, the property of interest is approximated linearly which can introduce a significant error when this property is nonlinear.
Markov chain Monte Carlo (MCMC) sampling of transition matrices [2, 6, 7]. Here, transition matrices are sampled from the posterior distribution, and the property of interest is computed for each of these and stored as samples from the posterior distribution of the property. This approach requires that the sampling procedure be run sufficiently long that good estimates of standard deviations or confidence intervals of the posterior distribution of the property of interest can be computed, which may be time-consuming. The advantage of this approach is that no assumptions are made concerning the functional form of the distribution or the property being computed. Furthermore, this approach can be straightforwardly applied to any function or property of transition matrices, including complex properties such as transition path distributions [10] without deriving the expressions necessary for the linear error perturbation analysis—often a cumbersome task. However, for large state spaces, the transition matrix T may grow so large as to make this procedure impractical.

5.3 Linear Error Propagation

We start again with the posterior distribution of row-stochastic transition matrices without the detailed balance constraint, given by Eq. (4.10). Defining a new matrix U,

$$ \mathbf{U}=[u_{ij}]=[c_{ij}+1], $$

(5.6)

and using that the posterior probability p(T∣C ^obs) implicitly contains the prior probabilities Eq. (4.10) can be rewritten as:

$$ p(\mathbf{T}\mid{\mathbf{C}}) = p\bigl(\mathbf{T}\bigm|\mathbf{C}^{\mathrm{obs}} \bigr)\propto\prod_{i}\prod _{j}T_{ij}^{u_{ij}-1} $$

(5.7)

such that

$$ \mathbf{T}_{i*} \sim\prod_{i} \mathrm{Dir} (\mathbf{u}_{i*} ) $$

(5.8)

where Dir(α) denotes the Dirichlet distribution, and θ∼Dir(α) implies that θ is drawn from the distribution

$$ p(\boldsymbol{\theta}) \propto\prod_i \theta_i^{\alpha_i - 1}. $$

(5.9)

Based on well-established properties of this distribution, and using the abbreviation u _i=∑_j u _ij, the moments of p(T∣C) can be directly computed,

$$\begin{aligned} \bigl[\mathbb{E}(\mathbf{T})\bigr]_{ij} = & \frac{u_{ij}}{u_{i}}= \frac {c_{ij}+1}{c_{i}+n} = \bar{T}_{ij}, \\ \bigl(\arg\max p(\mathbf{T}|\mathbf{C})\bigr)_{ij} = & \frac {u_{ij}-1}{u_{ij}-n}=\frac{c_{ij}}{c_{i}} = \hat{T}_{ij}, \\ \operatorname{Var}(T_{ij}) = & \frac {u_{ij}(u_{i}-u_{ij})}{u_{i}^{2}(u_{i}+1)}\\ =& \frac{\bar{T}_{ij}(1-\bar {T}_{ij})}{(u_{i}+1)}\\ =&\frac{\bar{T}_{ij}(1-\bar{T}_{ij})}{c_{i}+n-1}, \\ \operatorname{Cov}(T_{ij},T_{ik}) = & \frac {-u_{ij}u_{ik}}{u_{i}^{2}(u_{i}+1)} \quad \forall j\neq k. \end{aligned}$$

Next, we determine how the uncertainties given by the variances and covariances of the transition matrix elements propagate onto uncertainties of functions derived from transition matrices, such as eigenvalues. If we do not have constraints between different rows, such as are imposed by detailed balance, the rows can be treated as independent random vectors, and thus,

$$ \operatorname{Cov} (T_{ij},T_{lk} )=0,\quad i\neq l . $$

(5.10)

We can thus define a covariance matrix Σ ⁽ⁱ⁾ separately for each row i as,

$$\begin{aligned} \varSigma_{jk}^{(i)} :=&\operatorname{Cov} (T_{ij},T_{ik} )\\ = & \frac {1}{u_{i}^{2}(u_{i}+1)} [u_{i}\delta_{jk}u_{ij}-u_{ij}u_{ik} ] \\ = & \frac{1}{c_i} \bigl[\delta_{jk}\bar{T}_{ij}- \bar{T}_{ij}\bar {T}_{ik}^{T} \bigr] , \end{aligned}$$

where δ is the Kronecker delta. Alternatively, we can write the covariance matrix Σ ⁽ⁱ⁾ in vector notation,

$$\begin{aligned} \boldsymbol{\varSigma}^{(i)} = & \frac{1}{u_{i}^{2}(u_{i}+1)} \bigl[u_{i}\operatorname{diag} (\mathbf{u}_{i*} )- \mathbf{u}_{i*}(\mathbf {u}_{i*})^{T} \bigr] \\ = & \frac{1}{c_i} \bigl[\operatorname{diag} (\bar{\mathbf{T}}_{i*} )-\bar {\mathbf{T}}_{i*}(\bar{\mathbf{T}}_{i*})^{T} \bigr] . \end{aligned}$$

In the limit of many observed transition counts, the covariance for the Dirichlet processes scales approximately with the inverse of the total number of counts in a row, c _i.

With a sufficient number of counts c _i in each row i, the Dirichlet process resembles a multivariate Gaussian distribution, and we can approximate it as such using the mean and variance computed above,

$$ \mathbf{T}_{i*}\sim\mathrm{Normal} \bigl(\hat{\mathbf {T}}_{i*},\boldsymbol{\varSigma}^{(i)} \bigr). $$

(5.11)

This approximate distribution is used in a Gaussian error propagation for linear functions of the transition matrix. Let us assume that we are interested in computing the statistical error of a scalar functions $f(\mathbf{T}) : \mathbb{R}^{n\times n}\rightarrow\mathbb{R}$. The first order Taylor approximation is given by:

$$f(\mathbf{T})=f(\hat{\mathbf{T}})+\sum_{i,j} \frac{\partial f}{\partial T_{ij}}\bigg\vert _{\hat{\mathbf{T}}}(T_{ij}-\hat{T}_{ij}). $$

Since the uncertainty in the rows of T contribute independently to the uncertainty in f, we define a sensitivity vector s ⁽ⁱ⁾ for each row separately

$$s_{j}^{(i)}=\frac{\partial f}{\partial T_{ij}}(\hat{\mathbf{T}}) $$

that measures the sensitivity of the scalar function with respect to changes in the transition matrix elements. Then, with the function for the error propagation, we get

$$\hat{f}=f (\hat{\mathbf{T}} ) $$

obtaining an approximation for the variance in f,

$$\operatorname{Var} (f )=\operatorname{Cov}(f,f)=\sum _{i} \bigl(\mathbf {s}^{(i)} \bigr)^{T} \boldsymbol{\varSigma}^{(i)}\mathbf{s}^{(i)}. $$

or, more general, for the covariances between different scalar functions f, and g

$$\operatorname{Cov}(f,g) = \sum_{i} \bigl( \mathbf{s}[f]^{(i)} \bigr)^{T}\boldsymbol{\varSigma}^{(i)} \mathbf{s}[g]^{(i)}. $$

where s[f]⁽ⁱ⁾ and s[g]⁽ⁱ⁾ refer to the sensitivities of f and g respectively. The limitation of this approach is that it does not work well in situations where the Transition matrix distribution is far from Gaussian (especially in the situation of little data). Furthermore, the more nonlinear a given function of interest is in terms of T _ij, the more the estimated uncertainty on this function might be wrong.

5.3.1 Example: Eigenvalues

As an example, we consider the computation of statistical error in a particular eigenvalue λ _k of the transition matrix T using the linear error propagation scheme, closely following the approach described in Refs. [4, 13].

We start from the eigenvalue decomposition of the transition matrix T, omitting the dependence on the lag time τ,

$$ \boldsymbol{\varLambda}=\boldsymbol{\varPhi}\mathbf{T}\boldsymbol{\varPsi} $$

(5.12)

where Ψ=[ψ ₁,…,ψ _n] is the right eigenvector matrix, Φ=[ϕ ₁,…,ϕ _n]^T=Ψ ⁻¹ is the left eigenvector matrix, and $\boldsymbol{\varLambda}=\operatorname{diag}(\lambda_{i})$ is the diagonal matrix of eigenvalues. For the kth eigenvalue-eigenvector pair, we have,

$$\begin{aligned} \lambda^{(k)} = & \bigl(\boldsymbol{\phi}^{(k)} \bigr)^{T}\mathbf{T}\boldsymbol {\psi}^{(k)} = \sum _{i,j}\phi_{i}^{(k)}T_{ij} \psi_{j}^{(k)}. \end{aligned}$$

We wish to compute the statistical error of the eigenvalues λ ^(k) via linear error perturbation. In general, both the eigenvalues and eigenvectors simultaneously depend on perturbations in the elements of T in a complex way. To first order, the partial derivatives of the eigenvalues with respect to the transition matrix elements is given by the inner product of left and right eigenvectors,

$$ \frac{\partial\lambda^{(k)}}{\partial T_{ij}}=\phi_{i}^{(k)}\psi_{j}^{(k)}. $$

(5.13)

This expression for the eigenvalue sensitivity may be combined with Eq. (5.11) in order to yield the linear perturbation result,

$$\begin{aligned} \operatorname{Var} \bigl(\lambda^{(k)} \bigr) = & \sum _{i=1}^{n}\sum_{a,b} \frac{\partial\lambda^{(k)}}{\partial T_{ia}}\operatorname{Cov}(T_{ab})\frac{\partial\lambda^{(k)}}{\partial T_{ib}} \\ = & \sum_{i=1}^{n}\sum _{a,b}\phi_{i}^{(k)} \psi_{a}^{(k)} \biggl(\sum_{a} \frac{u_{ia}(u_{i}-u_{ia})}{u_{i}^{2}(u_{i}+1)}\\ &{}+\sum_{a,b\neq a}\frac{-u_{ia}u_{ib}}{u_{i}^{2}(u_{i}+1)} \biggr)\phi_{i}^{(k)}\psi_{b}^{(k)}. \end{aligned}$$

5.4 Sampling Transition Matrices Without Detailed Balance Constraint

In a full Bayesian approach, we sample the posterior distribution,

$$ p(\mathbf{T}\mid\mathbf{C})\propto p(\mathbf{T})p(\mathbf{C}\mid\mathbf {T}) = \prod_{i,j}T_{ij}^{c_{ij}} $$

(5.14)

where we recall that the total count matrix C=C ^obs+C ^prior, as discussed in Chap. 4, makes the use of different priors straightforward. If the only constraint of T is that it is a stochastic matrix, but we do not expect that T fulfills detailed balance, we can view Eq. (5.14) as a product of Dirichlet distributions, one for each row (see Eq. (5.7)). We are then faced with the problem of sampling random variables from the distribution,

$$ \mathbf{T}_{i*} \sim\mathrm{Dir} (\mathbf{u}_{i*} ) . $$

(5.15)

A fast way to generate Dirichlet-distributed random variables is to draw n independent samples y ₁,…,y _n from univariate Gamma distributions, each with density,

$$\begin{aligned} &y_{j}\sim\mathrm{Gamma}(c_{ij}+1,1)=\frac {y_{j}^{c_{ij}}e^{-y_{j}}}{\varGamma(c_{ij}+1)}, \\ &\quad j = 1, \ldots, n , \end{aligned}$$

(5.16)

and then obtain the T _ij by normalization of each row,

$$ T_{ij}=\frac{y_{j}}{\sum_{m=1}^{n}y_{k}}. $$

(5.17)

Repeating this procedure independently for every row i=1,…,n will generate a statistically independent sample of T from distribution (5.14).

5.5 Sampling the Reversible Transition Matrix Distribution

No similarly simple approach to direct generation of statistically independent samples of the distribution (5.14) exists when the transition matrix T is further constrained to satisfy that the transition matrices fulfill detailed balance. To include the detailed balance constraints, we consider sampling Eq. (5.14) using the Metropolis-Hastings algorithm, where we propose a change to the transition matrix, T→T′. This proposal is accepted with probability given by the Metropolis-Hastings criterion,

$$\begin{aligned} p_{\mathrm{acc}} =& \frac{p(\mathbf{T}'\rightarrow\mathbf{T})}{p(\mathbf {T}\rightarrow\mathbf{T}')} \frac{p(\mathbf{T}'|\mathbf{C})}{p(\mathbf {T}|\mathbf{C})} \\ =&\frac{p(\mathbf{T}'\rightarrow\mathbf{T})}{p(\mathbf {T}\rightarrow\mathbf{T}')} \frac{p(\mathbf{C}|\mathbf{T}')}{p(\mathbf {C}|\mathbf{T})} \\ =&\frac{p(\mathbf{T}'\rightarrow\mathbf{T})}{p(\mathbf {T}\rightarrow\mathbf{T}')} \frac{\prod_{i,j}{T'}_{ij}^{c_{ij}}}{\prod_{i,j}T_{ij}^{c_{ij}}}. \end{aligned}$$

(5.18)

This scheme requires efficient schemes to generate proposals T→T′ that maintain the detailed balance constraint and are likely to be accepted, as well as a method of efficiently computing the ratio of transition probabilities p(T′→T)/p(T→T′) for each proposal. Such a scheme was worked out in detail in Ref. [7], and we summarize the resulting method as Algorithm 2.

Example 1

Every 2×2 transition matrix is reversible. To see this, we can compute the stationary distribution from the dominant eigenvector,

$$ \boldsymbol{\pi}= \biggl(\frac{T_{21}}{T_{12}+T_{21}},\frac {T_{12}}{T_{12}+T_{21}} \biggr) , $$

(5.19)

from which we can see that detailed balance is always fulfilled,

$$ \pi_1 T_{12} = \frac{T_{21}}{T_{12}+T_{21}} T_{12} = \frac {T_{12}}{T_{12}+T_{21}} T_{21} = \pi_2 T_{21} . $$

(5.20)

Indeed, for 2×2 matrices the nonreversible transition matrix sampling scheme (Sect. 5.4) generates the same distribution as the reversible transition matrix sampling scheme in Algorithm 2. See Fig. 5.1B for an illustration of this sampling scheme applied to a 2×2 matrix.

Example 2

Figure 5.2 illustrates how the distribution of a 3×3 transition matrix differs between the nonreversible (panels B, E, H) and reversible (panels C, F, I) cases. For the matrix studied here, the distribution of reversible matrices is slightly narrower.

5.5.1 Sampling with Fixed Stationary Distribution

In some cases, the stationary distribution, π, may be known exactly or to very small statistical error. For example, an efficient equilibrium simulation scheme (such as parallel tempering or metadynamics) or a Monte Carlo method may have generated a very precise estimate of π by simulating a perturbed system or one with unphysical dynamics. It may be useful to incorporate this information about π when inferring the posterior distribution of transition matrices, since it may significantly reduce the uncertainty.

To do this, we first note that the two types of Monte Carlo proposals utilized in Algorithm 2 above for sampling reversible transition matrices. One type of proposal (reversible element shifts) changes π, while the other preserves π (node shift). We can suggest a straightforward modification of the T-sampling algorithm that will ensure π is constrained to some specified value during the sampling procedure.

We first give an algorithm to construct an initial transition matrix T ⁽⁰⁾ with a specified stationary distribution π from a given count matrix C (Algorithm 3), and then use this to initialize a Monte Carlo transition matrix sampling algorithm that preserves the stationary distribution (Algorithm 4).

5.6 Full Bayesian Approach with Uncertainty in the Observables

Suppose we are interested in some experimentally-measurable function of state A(x). An experiment may be able to measure an expectation 〈A〉 or correlation functions 〈A(0)A(t)〉, and we would like to compute the corresponding properties from the Markov model constructed from a molecular simulation and decide whether they agree with experiment to within statistical uncertainty, or if a prediction from the model is sufficiently precise to be useful. The previous framework for sampling transition matrices can be used in the following manner: (i) Assign the state-averaged value of the observable, $a_{i}=\int_{S_{i}}d\mathbf{x} \mu(\mathbf{x}) A(\mathbf {x})$, to each discrete state. (ii) Generate an ensemble of T-matrices according to the sampling scheme described above. (iii) Calculate the desired expectation or correlation function for each T-matrix using the discrete vector a=[a _i]. This approach involves several approximations that each deserve discussion. Here, we want to generalize the approach by eliminating one important approximation—that the values a _i are known exactly without statistical error themselves.

In a typical simulation scenario, the average a _i is itself calculated by a statistical sample. When a simulation trajectory x _t is available, then typically the time average

$$ \hat{a}_{i}=\frac{\sum_{t}\chi_{i}(\mathbf{x}_{t}) A(\mathbf {x}_{t})}{\sum_{t}\chi_{i}(\mathbf{x}_{t})} $$

(5.21)

is employed, where χ _i is the indicator function of state i. The estimate $\hat{a}_{i}$ may in fact have significant statistical error because the number of uncorrelated samples of x _t inside any state i is finite, and possibly rather small. In order to estimate the distribution of expectation or correlation functions of A due to both, the statistical uncertainty of T and the statistical uncertainty of $\hat{a}_{i}$, we propose a full Bayesian approach using a Gibbs sampling scheme, here illustrated for the expectation $\mathbb{E}[A]$ (Algorithm 5).

While the transition matrix T ^(k) can be sampled using the framework described in the previous sections, an approach to sample a ^(k) introduced in Ref. [2] is described subsequently.

5.6.1 Sampling State Expectations a ^(k)

Consider the expectation of some molecular observable A(x) computed from Eq. (5.21). Temporally sequential samples A _t≡A(x _t) collected with a temporal resolution of the Markov time τ are subsequently presumed to be uncorrelated. We also assume that the set of samples A(x _t) for those configurations x _t appearing in state i are collected in the set $\{A_{m}\}_{m=1}^{N}$ in the remainder of this section, generally abbreviated as {A _m}.

Because only a finite number of samples N are collected for each state, there will be a degree of uncertainty in this estimate. Unlike the problem of inferring the transition matrix elements, however, we cannot write an exact expression for the probability of observing a single sample A _m in terms of a simple parametric form, since its probability distribution may be arbitrarily complex,

$$ p_{i}(A_{m}) = \frac{1}{\pi_{i}}\int _{S_{i}} d\mathbf{x} \, \delta \bigl(A_{m}-A( \mathbf{x})\bigr) \mu(\mathbf{x}) . $$

(5.22)

Despite this, the central limit theorem states that the behavior of $\hat{a}_{i}$ approaches a normal distribution (generally very rapidly) as the number of samples N increases. We will therefore make the assumption that p _i(A _m) is normal—that is, we assume the distribution can be characterized by mean μ _i and variance $\sigma_{i}^{2}$,

$$ A_{m} \sim\mathrm{Normal}\bigl(\mu_i, \sigma_i^2\bigr) $$

(5.23)

where the normal distribution implies the probability density for A _m is approximated by

$$\begin{aligned} &\tilde{p}_{i}\bigl(A_{m};\mu_{i}, \sigma_{i}^{2}\bigr) \\ &\quad= (2\pi)^{-1/2}\sigma _{i}^{-1}\exp \biggl[-\frac{1}{2\sigma_{i}^{2}}(A_{m}- \mu_{i})^{2} \biggr] . \end{aligned}$$

(5.24)

While this may seem like a drastic assumption, it turns out this approximation allows us to do a surprisingly good job of inferring the distribution of the error in $\delta\hat{a}_{i}\equiv\hat {a}_{i}-\langle A\rangle_{i}$ even for a small number of samples from each state, and generally gives an overestimate of the error (which is arguably less dangerous than an underestimate) for smaller sample sizes. While the validity of this approximation is illustrated in a subsequent example, we continue below to develop the ramifications of this approximation.

Consider the sample mean estimator for 〈A〉_i,

$$\begin{aligned} \hat{\mu} = & \frac{1}{N}\sum_{m=1}^{N}A_{m} . \end{aligned}$$

(5.25)

The asymptotic variance of $\hat{\mu}$, which provides a good estimate of the statistical uncertainty in $\hat{\mu}$ in the large-sample limit, is given as a simple consequence of the central limit theorem,

$$\begin{aligned} \delta^{2}\hat{\mu} \equiv&\mathbb{E} \bigl[\bigl(\hat{\mu}-\mathbb{E}[ \hat{\mu }]\bigr)^{2} \bigr] \\ =&\frac{\operatorname{Var}A_{m}}{N}\approx\frac{\hat{\sigma}^{2}}{N} \end{aligned}$$

(5.26)

where the unbiased estimator for the variance $\sigma^{2}\equiv\operatorname{Var}A_{m}$ is given by

$$\begin{aligned} \hat{\sigma}^{2} \equiv& \frac{1}{N-1}\sum _{m=1}^{N}(A_{m}-\hat{\mu})^{2} \end{aligned}$$

(5.27)

Suppose we now assume the distribution of A from state i is normal (Eq. (5.24)),

$$\begin{aligned} A | \mu,\sigma^{2} \sim& \mathrm{Normal}\bigl(\mu, \sigma^{2}\bigr) . \end{aligned}$$

(5.28)

Were this to be a reasonable model, we could model the timeseries of the observable A _t≡A(x _t) by the hierarchical process:

$$\begin{aligned} \begin{aligned} s_{t} | s_{t-1},\mathbf{T} & \sim \mathrm{Bernoulli}(T_{s_{t-1} 1},\ldots,T_{s_{t-1} N}), \\ A_{t} | \mu_{s_{t}},\sigma_{s_{t}}^{2} & \sim \mathrm{Normal}\bigl(\mu _{s_{t}},\sigma_{s_{t}}^{2} \bigr). \end{aligned} \end{aligned}$$

(5.29)

Here, the notation Bernoulli(π ₁,…,π _N) denotes a Bernoulli scheme where discrete outcome n has associated probability π _n of being selected. We will demonstrate below how this model does in fact recapitulate the expected behavior in the limit where there are sufficient samples from each state.

We choose the (improper) Jeffreys prior [5],

$$\begin{aligned} p\bigl(\mu,\sigma^{2}\bigr) \propto& \sigma^{-2} \end{aligned}$$

(5.30)

because it satisfies intuitively reasonable reparameterization [5] and information-theoretic [3] invariance principles. Note that this prior is uniform in (μ,logσ).

The posterior is then given by

$$\begin{aligned} &p\bigl(\mu,\sigma^{2}\bigm|\{A_{m}\}\bigr) \\ &\quad\propto \Biggl[ \prod_{n=1}^{N}p\bigl(A_{m}\bigm|\mu, \sigma^{2}\bigr) \Biggr] p\bigl(\mu,\sigma^{2}\bigr) \\ &\quad \propto \sigma^{-(N+2)} \exp \Biggl[-\frac{1}{2\sigma^{2}}\sum _{m=1}^{N}(A_{m}-\mu)^{2} \Biggr] . \end{aligned}$$

(5.31)

Rewriting in terms of the sample statistics $\hat{\mu}$ and $\hat{\sigma}^{2}$, we obtain

$$\begin{aligned} &p\bigl(\mu,\sigma^{2}\bigm|\{A_{m}\}\bigr) \\ &\quad \propto \sigma^{-(N+2)} \exp \Biggl\{ -\frac{1}{2\sigma^{2}} \Biggl[ \sum_{m=1}^{N}(A_{m}-\hat{ \mu})^{2} \\ &\qquad{}+N(\hat{\mu}-\mu)^{2} \Biggr] \Biggr\} \\ &\quad \propto \sigma^{-(N+2)} \exp \biggl\{ -\frac{1}{2\sigma^{2}} \bigl[(N-1)\hat{\sigma}^{2} \\ &\qquad{}+N(\hat{\mu}-\mu)^{2} \bigr] \biggr\} . \end{aligned}$$

(5.32)

The posterior has marginal distributions

$$\begin{aligned} \begin{aligned} \sigma^{2} | \{A_{m}\} & \sim \mathrm{Inv-} \chi^{2}\bigl(N-1,\hat{\sigma }^{2}\bigr), \\ \mu | \{A_{m}\} & \sim \mathrm{t}_{N-1}\bigl(\hat{\mu},\hat{ \sigma}^{2}/N\bigr) \end{aligned} \end{aligned}$$

(5.33)

where σ ² is distributed according to scaled inverse chi-square distribution with N−1 degrees of freedom, and μ according to Student’s t-distribution with N−1 degrees of freedom that has been shifted to be centered about $\hat{\mu}$ and whose width has been scaled by $\hat{\sigma}^{2}/N$.

As can be seen in Fig. 5.3, as the number of degrees of freedom increases, the marginal posterior for μ approaches the normal distribution with the asymptotic behavior expected from standard frequentest analysis for the standard error of the mean, namely

$$\begin{aligned} \mu\rightarrow\mathrm{N}\bigl(\hat{\mu},\hat{\sigma}^{2}/N\bigr) . \end{aligned}$$

(5.34)

At low sample counts, the t-distribution is lower and wider than the normal distribution, meaning that confidence intervals computed from this distribution will be somewhat larger than those of the corresponding normal estimate for small samples. In some sense, this partly compensates for $\hat{\sigma}^{2}$ being a poor estimate of the true variance for small sample sizes, which would naturally lead to underestimates of the statistical uncertainty. In any case, this is also far from the asymptotic limit where the normal distribution with variance $\hat{\sigma}^{2}/N$ is expected to model the uncertainty well.

The posterior can also be decomposed as

$$\begin{aligned} &p\bigl(\mu,\sigma^{2}\bigm|\{A_{m}\}\bigr) \\ &\quad = p\bigl(\mu \bigm|\sigma^{2},\{A_{m}\}\bigr) p\bigl(\sigma ^{2} \bigm|\{A_{m}\}\bigr). \end{aligned}$$

(5.35)

This readily suggests a two-step sampling scheme for generating uncorrelated samples of (μ,σ ²), in which we first sample σ ² from its marginal distribution, and then μ from its distribution conditional on σ ²

$$\begin{aligned} \begin{aligned} \sigma^{2} | \{A_{m}\} & \sim \mathrm{Inv-} \chi^{2}\bigl(N-1,\hat{\sigma }^{2}\bigr), \\ \mu | \sigma^{2}, \{A_{m}\} & \sim \mathrm{N}\bigl(\hat{ \mu},\sigma^{2}/N\bigr). \end{aligned} \end{aligned}$$

(5.36)

Alternatively, if the scaled inverse-chi-square distribution is not available, the χ ²-distribution (among others) can be used to sample σ ²:

$$ (N-1) \bigl(\hat{\sigma}^{2}/\sigma^{2}\bigr) \bigm| \{A_{m}\} \sim \chi^{2}(N-1) $$

(5.37)

where the first argument is the shape parameter and the second argument is the scale parameter.

5.6.2 Illustration of Fully Bayesian Sampling Scheme

Using the sampling procedures described previously, we are now equipped with a scheme to sample from the joint posterior describing our confidence in that a Markov model characterized by a transition matrix T and state expectations μ _i, i=1,…,M, produced the observed trajectory data. Using a set of models sampled from this posterior, we can characterize the statistical component of the uncertainty as it propagates into equilibrium averages, non-equilibrium relaxations, and (non-)equilibrium correlation measurements computed from the Markov model. To ensure the correctness of this procedure, however, we first test its ability to correctly characterize the posterior distribution for a finite-size sample from a true Markovian model system.

How can we test a Bayesian posterior distribution? One of the more powerful features of a Bayesian model is its ability to provide confidence intervals that correctly reflect the level of certainty that the true value will lie within it. For example, if the experiment were to be repeated many times, the true value of the parameter being estimated should fall within the confidence interval for a 95 % confidence level 95 % of the time. As an illustrative example, consider a biased coin where the probability of turning heads is θ. From an observed sample of N coin flips, we can estimate θ using a Binomial model for the number of coin flips that turn up heads and a conjugate Beta Jeffreys prior [3, 5]. Each time we run an experiment and generate a new independent collection of N samples, we get a different posterior estimate for θ, and a different confidence interval (Fig. 5.4, top). If we run many trials and record what fraction of the time the true (unknown) value of θ falls within the confidence interval estimated from that trial, we can see if our model is correct. If correct, the observed confidence level should match the desired confidence level (Fig. 5.4, bottom right). Deviation from parity means that the posterior is either two broad or too narrow, and that the statistical uncertainty is being either over- or underestimated (Fig. 5.4, bottom left).

We performed a similar test on a three-state model system, using a model (reversible, row-stochastic) transition matrix for one Markov time is given by

$$ \mathbf{T}(1) = \left [ \begin{array}{c@{\quad}c@{\quad}c} 0.86207 & 0.12931 & 0.00862\\ 0.15625 & 0.83333 & 0.01041\\ 0.00199 & 0.00199 & 0.99602 \end{array} \right ]. $$

(5.38)

Each state is characterized by a mean value of the observable A(x), fixed to 3,2, and 1 for the first, second, and third states, respectively. The equilibrium populations are π≈[0.16250.13450.7031]. Simulation from this model involves a stochastic transition according to the transition element T _ij followed by observation of the value of A(x) sampled i.i.d. from the current state’s probability distribution p _i(A). Multiple independent realizations of this process were carried out, and subjected to the Bayesian inference procedure for transition matrices and observables described above. The nonequilibrium relaxation $\langle A\rangle_{\rho_{0}}$ from the initial condition ρ ₀=[100] in which all density is concentrated in state 1, as well as the autocorrelation function 〈A(0)A(t)〉, is shown in Fig. 5.5.

With the means of p _i(A) within each state fixed as above, we considered models for p _i(A) that were either normal or exponential, using the probability density functions:

$$\begin{aligned} &p_{i}(A) =(2\pi)^{-1/2}\sigma_{i}^{-1} \exp \biggl[-\frac{1}{2\sigma _{i}^{2}}(A-\mu_{i})^{2} \biggr] ,\\ &\quad \mathrm{normal} \\ &p_{i}(A) =\mu_{i}^{-1}\exp[-A/ \mu_{i}] ,\\ &\quad A\ge0.\quad \mathrm{exponential} \end{aligned}$$

While the normal output distribution for p _i(A) corresponds to the hierarchical Bayesian model that forms the basis for our approach, the exponential distribution is significantly different, and represents a challenging test case.

Figure 5.6 depicts the resulting uncertainty estimates for both normal (top) and exponential (bottom) densities for the observable A. In both cases, the confidence intervals are underestimated for short trajectory lengths (1 000 steps) where, in many realizations, few samples are observed in one or more states, so that the variance is underestimated or the effective asymptotic limit has not yet been reached. As the simulation length is increased to 10 000 or 100 000 steps so that it is much more likely there are a sufficient number of samples in each state to reach the asymptotic limit, however, the confidence intervals predicted by the Bayesian posterior become quite good. For the exponential model for observing values of A (which might be the case in, say, fluorescence lifetimes), we observe similar behavior. Except for what appears to be a slight, consistent underestimation of $\langle A(t)\rangle_{\rho_{0}}$ (much less than half a standard deviation) there appears to be excellent agreement between the expected and observed confidence intervals, confirming that this method is expected to be a useful approach to modeling statistical uncertainties in equilibrium and kinetic observables.

References

Anderson TW, Goodman LA (1957) Statistical inference about Markov chains. Ann Math Stat 28:89–110
Article Google Scholar
Chodera JD, Noé F (2010) Probability distributions of molecular observables computed from Markov models, II: uncertainties in observables and their time-evolution. J Chem Phys 133:105,102
Article Google Scholar
Goyal P (2005) Prior probabilities: an information-theoretic approach. In: Knuth KH, Abbas AE, Morriss RD, Castle JP (eds) Bayesian inference and maximum entropy methods in science and engineering. American Institute of Physics, New York, pp 366–373
Google Scholar
Hinrichs NS, Pande VS (2007) Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics. J Chem Phys 126:244,101
Article Google Scholar
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc R Soc A 186:453–461
Article CAS Google Scholar
Metzner P, Noé F, Schütte C (2009) Estimation of transition matrix distributions by Monte Carlo sampling. Phys Rev E 80:021,106
Article Google Scholar
Noé F (2008) Probability distributions of molecular observables computed from Markov models. J Chem Phys 128:244,103
Article Google Scholar
Noé F, Oswald M, Reinelt G (2007) Optimizing in graphs with expensive computation of edge weights. In: Kalcsics J, Nickel S (eds) Operations research proceedings. Springer, Berlin, pp 435–440
Google Scholar
Noé F, Oswald M, Reinelt G, Fischer S, Smith JC (2006) Computing best transition pathways in high-dimensional dynamical systems: application to the alphaL–beta–alphaR transitions in octaalanine. Multiscale Model Simul 5:393–419
Article Google Scholar
Noé F, Schütte C, Vanden-Eijnden E, Reich L, Weikl TR (2009) Constructing the full ensemble of folding pathways from short off-equilibrium simulations. Proc Natl Acad Sci USA 106:19,011–19,016
Article Google Scholar
Prinz JH et al. (2011) Markov models of molecular kinetics: generation and validation. J Chem Phys 134:174,105
Article Google Scholar
Prinz JH, Held M, Smith JC, Noé F (2011) Efficient computation of committor probabilities and transition state ensembles. Multiscale Model Simul 9:545
Article CAS Google Scholar
Singhal N, Pande VS (2005) Error analysis and efficient sampling in Markovian state models for molecular dynamics. J Chem Phys 123:204,909
Article Google Scholar

Download references

Author information

Authors and Affiliations

Freie Universität Berlin, Arnimallee 6, 14195, Berlin, Germany
Frank Noé
Memorial Sloan-Kettering Cancer Center, New York, NY, 10065, USA
John D. Chodera

Authors

Frank Noé
View author publications
You can also search for this author in PubMed Google Scholar
John D. Chodera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Noé .

Editor information

Editors and Affiliations

University of California, Berkeley, California, USA
Gregory R. Bowman
Department of Chemistry, Stanford University, Stanford, California, USA
Vijay S. Pande
Freie Universität Berlin, Berlin, Germany
Frank Noé

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Noé, F., Chodera, J.D. (2014). Uncertainty Estimation. In: Bowman, G., Pande, V., Noé, F. (eds) An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation. Advances in Experimental Medicine and Biology, vol 797. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7606-7_5

Download citation

DOI: https://doi.org/10.1007/978-94-007-7606-7_5
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7605-0
Online ISBN: 978-94-007-7606-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics