Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

As only a finite quantity of data can be collected for the construction of Markov state models, the parameters characterizing the model and any properties computed from it will always be statistically uncertain. This chapter is concerned with the quantification of this statistical uncertainty, and its use in validation of model quality and prediction of properties using the model. In the following sections we proceed along Refs. [2, 7, 11] which should be used for reference purposes.

5.1 Uncertainties in Transition Matrix Elements

We first consider the uncertainty in the transition matrix T(τ) itself estimated from a finite quantity of data. It may be the case that the uncertainty in individual elements T ij (τ) may be of interest, in which case standard errors or confidence intervals of these estimates may be sufficient tools to quantify the uncertainty.

For a transition matrix estimated without the detailed balance constraint, the expectation and variance of individual elements follow from well-known properties of the distribution of stochastic matrices [1]. These uncertainties do, however, depend on the choice of prior used in modeling the full posterior for the transition matrix (Sect. 4.4). Under a uniform prior, the expectation and variance of an individual element T ij is given by,

$$\begin{aligned} \mathbb{E}[T_{ij}] =& \frac{c_{ij}+1}{c_{i}+n} \equiv\bar{T}_{ij}, \end{aligned}$$
(5.1)
$$\begin{aligned} \operatorname{Var}[T_{ij}] =& \frac {(c_{ij}+1)((c_{i}+n)-(c_{ij}+1))}{(c_{i}+n)^{2}((c_{i}+n)+1)} \\ =&\frac {\bar{T}_{ij}(1-\bar{T}_{ij})}{c_{i}+n+1} , \end{aligned}$$
(5.2)

where c ij and c i are the elements and row sums, respectively, of the observed count matrix C obs (Sect. 4.2).

To see the effect that the choice of prior has on the computed uncertainties, consider a trajectory of a given molecular system which is analyzed with two different state space discretizations. Assume one discretization uses n=10 states, and the other n=1000. Assume that a lag time τ has been chosen which is identical and long enough to provide Markov models with small discretization error for both n (as suggested in Sect. 4.7). With a uniform prior (\(c_{ij}=c_{ij}^{\mathrm{obs}}\)), the posterior expectation \(\bar{T}_{ij}\) would be different for the two discretizations: While in the n=10 case we can get a distinct transition matrix estimation, in the n=1000 case, most c ij are probably zero and c i n, such that the expectation value would be biased towards the uninformative T ij ≈1/n±1/n matrix, and many observed transitions would be needed to overcome this bias. This behavior is undesirable. Thus, for uncertainty estimation it is suggested to use a prior which allows the observation data to have more impact also in the low-data regime.

On the other hand, the “null prior” [10] defined by

$$ c_{ij}^{\mathrm{prior}}\rightarrow-1\quad\forall i,j\in\{1,\ldots,n\}, $$
(5.3)

leans to the other extreme. Under the null prior, the expectation and the variance of the marginalized posterior for a single T ij become,

$$\begin{aligned} \bar{T}_{ij} = & \mathbb{E}[T_{ij}]=\frac{c_{ij}^{\mathrm{obs}}}{c_{i}^{\mathrm{obs}}}= \hat{T}_{ij}, \end{aligned}$$
(5.4)
$$\begin{aligned} \operatorname{Var}(T_{ij}) = & \frac{c_{ij}^{\mathrm{obs}}(c_{i}^{\mathrm{obs}}-c_{ij}^{\mathrm{obs}})}{(c_{i}^{\mathrm{obs}})^{2}(c_{i}^{\mathrm{obs}}+1)} \\ =&\frac{\hat{T}_{ij}(1-\hat{T}_{ij})}{c_{i}^{\mathrm{obs}}+1}. \end{aligned}$$
(5.5)

Thus, with a null prior, the expectation value is located at the likelihood maximum. Both expectation value and variance are independent of the number of discretization bins used. The variance of any T ij asymptotically decays with the number of transitions out of the state i, which is expected for sampling expectations from the central limit theorem.

5.2 Uncertainties in Computed Properties

In practice, one is often not primarily interested in the uncertainties of the transition matrix elements themselves, but rather in the uncertainties in properties computed from the transition matrix. Here, we review two different approaches for this purpose.

  • Linear error perturbation [4, 12, 13]. Here, the transition matrix posterior distribution is approximated by a multivariate Gaussian, and the property of interest—taken to be a function of the transition matrix or its eigenvalues and eigenvectors—is approximated by a first-order Taylor expansion about the center of this Gaussian. This results in a Gaussian distribution of the property of interest, with a mean and a covariance matrix that can be computed in terms of the count matrix C. This approach has the advantage that error estimates and their rates of reduction for different sampling strategies can be computed through a direct procedure. As a result, it is convenient for situations where uncertainty estimates are used as part of an adaptive sampling procedure [4, 8, 9, 13]. The disadvantage of this approach is that the Gaussian approximation of the transition matrix posterior in only asymptotically correct, and can easily break down when few counts have been observed. In the low-data regime, the resulting Gaussian distribution for the property of interest often gives substantial probability to unphysical or meaningless values, such as when transition matrix elements T ij are allowed to assume values outside the range [0,1]). Moreover, the property of interest is approximated linearly which can introduce a significant error when this property is nonlinear.

  • Markov chain Monte Carlo (MCMC) sampling of transition matrices [2, 6, 7]. Here, transition matrices are sampled from the posterior distribution, and the property of interest is computed for each of these and stored as samples from the posterior distribution of the property. This approach requires that the sampling procedure be run sufficiently long that good estimates of standard deviations or confidence intervals of the posterior distribution of the property of interest can be computed, which may be time-consuming. The advantage of this approach is that no assumptions are made concerning the functional form of the distribution or the property being computed. Furthermore, this approach can be straightforwardly applied to any function or property of transition matrices, including complex properties such as transition path distributions [10] without deriving the expressions necessary for the linear error perturbation analysis—often a cumbersome task. However, for large state spaces, the transition matrix T may grow so large as to make this procedure impractical.

5.3 Linear Error Propagation

We start again with the posterior distribution of row-stochastic transition matrices without the detailed balance constraint, given by Eq. (4.10). Defining a new matrix U,

$$ \mathbf{U}=[u_{ij}]=[c_{ij}+1], $$
(5.6)

and using that the posterior probability p(TC obs) implicitly contains the prior probabilities Eq. (4.10) can be rewritten as:

$$ p(\mathbf{T}\mid{\mathbf{C}}) = p\bigl(\mathbf{T}\bigm|\mathbf{C}^{\mathrm{obs}} \bigr)\propto\prod_{i}\prod _{j}T_{ij}^{u_{ij}-1} $$
(5.7)

such that

$$ \mathbf{T}_{i*} \sim\prod_{i} \mathrm{Dir} (\mathbf{u}_{i*} ) $$
(5.8)

where Dir(α) denotes the Dirichlet distribution, and θ∼Dir(α) implies that θ is drawn from the distribution

$$ p(\boldsymbol{\theta}) \propto\prod_i \theta_i^{\alpha_i - 1}. $$
(5.9)

Based on well-established properties of this distribution, and using the abbreviation u i =∑ j u ij , the moments of p(TC) can be directly computed,

$$\begin{aligned} \bigl[\mathbb{E}(\mathbf{T})\bigr]_{ij} = & \frac{u_{ij}}{u_{i}}= \frac {c_{ij}+1}{c_{i}+n} = \bar{T}_{ij}, \\ \bigl(\arg\max p(\mathbf{T}|\mathbf{C})\bigr)_{ij} = & \frac {u_{ij}-1}{u_{ij}-n}=\frac{c_{ij}}{c_{i}} = \hat{T}_{ij}, \\ \operatorname{Var}(T_{ij}) = & \frac {u_{ij}(u_{i}-u_{ij})}{u_{i}^{2}(u_{i}+1)}\\ =& \frac{\bar{T}_{ij}(1-\bar {T}_{ij})}{(u_{i}+1)}\\ =&\frac{\bar{T}_{ij}(1-\bar{T}_{ij})}{c_{i}+n-1}, \\ \operatorname{Cov}(T_{ij},T_{ik}) = & \frac {-u_{ij}u_{ik}}{u_{i}^{2}(u_{i}+1)} \quad \forall j\neq k. \end{aligned}$$

Next, we determine how the uncertainties given by the variances and covariances of the transition matrix elements propagate onto uncertainties of functions derived from transition matrices, such as eigenvalues. If we do not have constraints between different rows, such as are imposed by detailed balance, the rows can be treated as independent random vectors, and thus,

$$ \operatorname{Cov} (T_{ij},T_{lk} )=0,\quad i\neq l . $$
(5.10)

We can thus define a covariance matrix Σ (i) separately for each row i as,

$$\begin{aligned} \varSigma_{jk}^{(i)} :=&\operatorname{Cov} (T_{ij},T_{ik} )\\ = & \frac {1}{u_{i}^{2}(u_{i}+1)} [u_{i}\delta_{jk}u_{ij}-u_{ij}u_{ik} ] \\ = & \frac{1}{c_i} \bigl[\delta_{jk}\bar{T}_{ij}- \bar{T}_{ij}\bar {T}_{ik}^{T} \bigr] , \end{aligned}$$

where δ is the Kronecker delta. Alternatively, we can write the covariance matrix Σ (i) in vector notation,

$$\begin{aligned} \boldsymbol{\varSigma}^{(i)} = & \frac{1}{u_{i}^{2}(u_{i}+1)} \bigl[u_{i}\operatorname{diag} (\mathbf{u}_{i*} )- \mathbf{u}_{i*}(\mathbf {u}_{i*})^{T} \bigr] \\ = & \frac{1}{c_i} \bigl[\operatorname{diag} (\bar{\mathbf{T}}_{i*} )-\bar {\mathbf{T}}_{i*}(\bar{\mathbf{T}}_{i*})^{T} \bigr] . \end{aligned}$$

In the limit of many observed transition counts, the covariance for the Dirichlet processes scales approximately with the inverse of the total number of counts in a row, c i .

With a sufficient number of counts c i in each row i, the Dirichlet process resembles a multivariate Gaussian distribution, and we can approximate it as such using the mean and variance computed above,

$$ \mathbf{T}_{i*}\sim\mathrm{Normal} \bigl(\hat{\mathbf {T}}_{i*},\boldsymbol{\varSigma}^{(i)} \bigr). $$
(5.11)

This approximate distribution is used in a Gaussian error propagation for linear functions of the transition matrix. Let us assume that we are interested in computing the statistical error of a scalar functions \(f(\mathbf{T}) : \mathbb{R}^{n\times n}\rightarrow\mathbb{R}\). The first order Taylor approximation is given by:

$$f(\mathbf{T})=f(\hat{\mathbf{T}})+\sum_{i,j} \frac{\partial f}{\partial T_{ij}}\bigg\vert _{\hat{\mathbf{T}}}(T_{ij}-\hat{T}_{ij}). $$

Since the uncertainty in the rows of T contribute independently to the uncertainty in f, we define a sensitivity vector s (i) for each row separately

$$s_{j}^{(i)}=\frac{\partial f}{\partial T_{ij}}(\hat{\mathbf{T}}) $$

that measures the sensitivity of the scalar function with respect to changes in the transition matrix elements. Then, with the function for the error propagation, we get

$$\hat{f}=f (\hat{\mathbf{T}} ) $$

obtaining an approximation for the variance in f,

$$\operatorname{Var} (f )=\operatorname{Cov}(f,f)=\sum _{i} \bigl(\mathbf {s}^{(i)} \bigr)^{T} \boldsymbol{\varSigma}^{(i)}\mathbf{s}^{(i)}. $$

or, more general, for the covariances between different scalar functions f, and g

$$\operatorname{Cov}(f,g) = \sum_{i} \bigl( \mathbf{s}[f]^{(i)} \bigr)^{T}\boldsymbol{\varSigma}^{(i)} \mathbf{s}[g]^{(i)}. $$

where s[f](i) and s[g](i) refer to the sensitivities of f and g respectively. The limitation of this approach is that it does not work well in situations where the Transition matrix distribution is far from Gaussian (especially in the situation of little data). Furthermore, the more nonlinear a given function of interest is in terms of T ij , the more the estimated uncertainty on this function might be wrong.

5.3.1 Example: Eigenvalues

As an example, we consider the computation of statistical error in a particular eigenvalue λ k of the transition matrix T using the linear error propagation scheme, closely following the approach described in Refs. [4, 13].

We start from the eigenvalue decomposition of the transition matrix T, omitting the dependence on the lag time τ,

$$ \boldsymbol{\varLambda}=\boldsymbol{\varPhi}\mathbf{T}\boldsymbol{\varPsi} $$
(5.12)

where Ψ=[ψ 1,…,ψ n ] is the right eigenvector matrix, Φ=[ϕ 1,…,ϕ n ]T=Ψ −1 is the left eigenvector matrix, and \(\boldsymbol{\varLambda}=\operatorname{diag}(\lambda_{i})\) is the diagonal matrix of eigenvalues. For the kth eigenvalue-eigenvector pair, we have,

$$\begin{aligned} \lambda^{(k)} = & \bigl(\boldsymbol{\phi}^{(k)} \bigr)^{T}\mathbf{T}\boldsymbol {\psi}^{(k)} = \sum _{i,j}\phi_{i}^{(k)}T_{ij} \psi_{j}^{(k)}. \end{aligned}$$

We wish to compute the statistical error of the eigenvalues λ (k) via linear error perturbation. In general, both the eigenvalues and eigenvectors simultaneously depend on perturbations in the elements of T in a complex way. To first order, the partial derivatives of the eigenvalues with respect to the transition matrix elements is given by the inner product of left and right eigenvectors,

$$ \frac{\partial\lambda^{(k)}}{\partial T_{ij}}=\phi_{i}^{(k)}\psi_{j}^{(k)}. $$
(5.13)

This expression for the eigenvalue sensitivity may be combined with Eq. (5.11) in order to yield the linear perturbation result,

$$\begin{aligned} \operatorname{Var} \bigl(\lambda^{(k)} \bigr) = & \sum _{i=1}^{n}\sum_{a,b} \frac{\partial\lambda^{(k)}}{\partial T_{ia}}\operatorname{Cov}(T_{ab})\frac{\partial\lambda^{(k)}}{\partial T_{ib}} \\ = & \sum_{i=1}^{n}\sum _{a,b}\phi_{i}^{(k)} \psi_{a}^{(k)} \biggl(\sum_{a} \frac{u_{ia}(u_{i}-u_{ia})}{u_{i}^{2}(u_{i}+1)}\\ &{}+\sum_{a,b\neq a}\frac{-u_{ia}u_{ib}}{u_{i}^{2}(u_{i}+1)} \biggr)\phi_{i}^{(k)}\psi_{b}^{(k)}. \end{aligned}$$

5.4 Sampling Transition Matrices Without Detailed Balance Constraint

In a full Bayesian approach, we sample the posterior distribution,

$$ p(\mathbf{T}\mid\mathbf{C})\propto p(\mathbf{T})p(\mathbf{C}\mid\mathbf {T}) = \prod_{i,j}T_{ij}^{c_{ij}} $$
(5.14)

where we recall that the total count matrix C=C obs+C prior, as discussed in Chap. 4, makes the use of different priors straightforward. If the only constraint of T is that it is a stochastic matrix, but we do not expect that T fulfills detailed balance, we can view Eq. (5.14) as a product of Dirichlet distributions, one for each row (see Eq. (5.7)). We are then faced with the problem of sampling random variables from the distribution,

$$ \mathbf{T}_{i*} \sim\mathrm{Dir} (\mathbf{u}_{i*} ) . $$
(5.15)

A fast way to generate Dirichlet-distributed random variables is to draw n independent samples y 1,…,y n from univariate Gamma distributions, each with density,

$$\begin{aligned} &y_{j}\sim\mathrm{Gamma}(c_{ij}+1,1)=\frac {y_{j}^{c_{ij}}e^{-y_{j}}}{\varGamma(c_{ij}+1)}, \\ &\quad j = 1, \ldots, n , \end{aligned}$$
(5.16)

and then obtain the T ij by normalization of each row,

$$ T_{ij}=\frac{y_{j}}{\sum_{m=1}^{n}y_{k}}. $$
(5.17)

Repeating this procedure independently for every row i=1,…,n will generate a statistically independent sample of T from distribution (5.14).

5.5 Sampling the Reversible Transition Matrix Distribution

No similarly simple approach to direct generation of statistically independent samples of the distribution (5.14) exists when the transition matrix T is further constrained to satisfy that the transition matrices fulfill detailed balance. To include the detailed balance constraints, we consider sampling Eq. (5.14) using the Metropolis-Hastings algorithm, where we propose a change to the transition matrix, TT′. This proposal is accepted with probability given by the Metropolis-Hastings criterion,

$$\begin{aligned} p_{\mathrm{acc}} =& \frac{p(\mathbf{T}'\rightarrow\mathbf{T})}{p(\mathbf {T}\rightarrow\mathbf{T}')} \frac{p(\mathbf{T}'|\mathbf{C})}{p(\mathbf {T}|\mathbf{C})} \\ =&\frac{p(\mathbf{T}'\rightarrow\mathbf{T})}{p(\mathbf {T}\rightarrow\mathbf{T}')} \frac{p(\mathbf{C}|\mathbf{T}')}{p(\mathbf {C}|\mathbf{T})} \\ =&\frac{p(\mathbf{T}'\rightarrow\mathbf{T})}{p(\mathbf {T}\rightarrow\mathbf{T}')} \frac{\prod_{i,j}{T'}_{ij}^{c_{ij}}}{\prod_{i,j}T_{ij}^{c_{ij}}}. \end{aligned}$$
(5.18)

This scheme requires efficient schemes to generate proposals TT′ that maintain the detailed balance constraint and are likely to be accepted, as well as a method of efficiently computing the ratio of transition probabilities p(T′→T)/p(TT′) for each proposal. Such a scheme was worked out in detail in Ref. [7], and we summarize the resulting method as Algorithm 2.

Algorithm 2
figure 1

Metropolis Monte Carlo sampling of reversible stochastic matrices

Example 1

Every 2×2 transition matrix is reversible. To see this, we can compute the stationary distribution from the dominant eigenvector,

$$ \boldsymbol{\pi}= \biggl(\frac{T_{21}}{T_{12}+T_{21}},\frac {T_{12}}{T_{12}+T_{21}} \biggr) , $$
(5.19)

from which we can see that detailed balance is always fulfilled,

$$ \pi_1 T_{12} = \frac{T_{21}}{T_{12}+T_{21}} T_{12} = \frac {T_{12}}{T_{12}+T_{21}} T_{21} = \pi_2 T_{21} . $$
(5.20)

Indeed, for 2×2 matrices the nonreversible transition matrix sampling scheme (Sect. 5.4) generates the same distribution as the reversible transition matrix sampling scheme in Algorithm 2. See Fig. 5.1B for an illustration of this sampling scheme applied to a 2×2 matrix.

Fig. 5.1
figure 2

Illustration of sampling of transition probability matrices for the observation and a uniform prior. Panels (a), (b), and (c) show the probability distribution on the off-diagonal matrix elements. The color encodes the probability density, with blue=0 and red=1. Each density was scaled such that its maximum is equal to 1. (a) Analytic density of stochastic matrices. (b) Sampled density of stochastic matrices (these matrices automatically fulfill detailed balance). (c) Stationary probability of the first state π 1. When sampling with respect to a fixed stationary probability distribution π , the ensemble is fixed to the line \(T_{21} = T_{12} \pi^{*}_{1}/(1-\pi^{*}_{1})\). (d) Sampled and exact density of T 12 of reversible matrices with fixed stationary distribution π =(0.5,0.5)

Example 2

Figure 5.2 illustrates how the distribution of a 3×3 transition matrix differs between the nonreversible (panels B, E, H) and reversible (panels C, F, I) cases. For the matrix studied here, the distribution of reversible matrices is slightly narrower.

Fig. 5.2
figure 3

Visualization of the probability density of transition matrices for the count matrix and a uniform prior. Different two-dimensional joint marginal distributions are shown in the rows. The analytic and sampled distributions for stochastic matrices are shown in columns 1 and 2, respectively. Column 3 shows the sampled distribution for stochastic matrices fulfilling detailed balance. Note how the peaks are more sharply peaked when the detailed balance constraint is imposed (column 3) compared to the corresponding transition matrices without detailed balance constraint (column 2)

5.5.1 Sampling with Fixed Stationary Distribution

In some cases, the stationary distribution, π, may be known exactly or to very small statistical error. For example, an efficient equilibrium simulation scheme (such as parallel tempering or metadynamics) or a Monte Carlo method may have generated a very precise estimate of π by simulating a perturbed system or one with unphysical dynamics. It may be useful to incorporate this information about π when inferring the posterior distribution of transition matrices, since it may significantly reduce the uncertainty.

To do this, we first note that the two types of Monte Carlo proposals utilized in Algorithm 2 above for sampling reversible transition matrices. One type of proposal (reversible element shifts) changes π, while the other preserves π (node shift). We can suggest a straightforward modification of the T-sampling algorithm that will ensure π is constrained to some specified value during the sampling procedure.

We first give an algorithm to construct an initial transition matrix T (0) with a specified stationary distribution π from a given count matrix C (Algorithm 3), and then use this to initialize a Monte Carlo transition matrix sampling algorithm that preserves the stationary distribution (Algorithm 4).

Algorithm 3
figure 4

Generation of an initial transition matrix T (0) given count matrix C and a specified stationary distribution π

Algorithm 4
figure 5

Metropolis-Hastings Monte Carlo sampling of reversible stochastic matrices with probability distribution of stationary distributions p(π)

5.6 Full Bayesian Approach with Uncertainty in the Observables

Suppose we are interested in some experimentally-measurable function of state A(x). An experiment may be able to measure an expectation 〈A〉 or correlation functions 〈A(0)A(t)〉, and we would like to compute the corresponding properties from the Markov model constructed from a molecular simulation and decide whether they agree with experiment to within statistical uncertainty, or if a prediction from the model is sufficiently precise to be useful. The previous framework for sampling transition matrices can be used in the following manner: (i) Assign the state-averaged value of the observable, \(a_{i}=\int_{S_{i}}d\mathbf{x} \mu(\mathbf{x}) A(\mathbf {x})\), to each discrete state. (ii) Generate an ensemble of T-matrices according to the sampling scheme described above. (iii) Calculate the desired expectation or correlation function for each T-matrix using the discrete vector a=[a i ]. This approach involves several approximations that each deserve discussion. Here, we want to generalize the approach by eliminating one important approximation—that the values a i are known exactly without statistical error themselves.

In a typical simulation scenario, the average a i is itself calculated by a statistical sample. When a simulation trajectory x t is available, then typically the time average

$$ \hat{a}_{i}=\frac{\sum_{t}\chi_{i}(\mathbf{x}_{t}) A(\mathbf {x}_{t})}{\sum_{t}\chi_{i}(\mathbf{x}_{t})} $$
(5.21)

is employed, where χ i is the indicator function of state i. The estimate \(\hat{a}_{i}\) may in fact have significant statistical error because the number of uncorrelated samples of x t inside any state i is finite, and possibly rather small. In order to estimate the distribution of expectation or correlation functions of A due to both, the statistical uncertainty of T and the statistical uncertainty of \(\hat{a}_{i}\), we propose a full Bayesian approach using a Gibbs sampling scheme, here illustrated for the expectation \(\mathbb{E}[A]\) (Algorithm 5).

Algorithm 5
figure 6

Gibbs sampler for the joint estimation of \(p(\mathbb{E}[A])\)

While the transition matrix T (k) can be sampled using the framework described in the previous sections, an approach to sample a (k) introduced in Ref. [2] is described subsequently.

5.6.1 Sampling State Expectations a (k)

Consider the expectation of some molecular observable A(x) computed from Eq. (5.21). Temporally sequential samples A t A(x t ) collected with a temporal resolution of the Markov time τ are subsequently presumed to be uncorrelated. We also assume that the set of samples A(x t ) for those configurations x t appearing in state i are collected in the set \(\{A_{m}\}_{m=1}^{N}\) in the remainder of this section, generally abbreviated as {A m }.

Because only a finite number of samples N are collected for each state, there will be a degree of uncertainty in this estimate. Unlike the problem of inferring the transition matrix elements, however, we cannot write an exact expression for the probability of observing a single sample A m in terms of a simple parametric form, since its probability distribution may be arbitrarily complex,

$$ p_{i}(A_{m}) = \frac{1}{\pi_{i}}\int _{S_{i}} d\mathbf{x} \, \delta \bigl(A_{m}-A( \mathbf{x})\bigr) \mu(\mathbf{x}) . $$
(5.22)

Despite this, the central limit theorem states that the behavior of \(\hat{a}_{i}\) approaches a normal distribution (generally very rapidly) as the number of samples N increases. We will therefore make the assumption that p i (A m ) is normal—that is, we assume the distribution can be characterized by mean μ i and variance \(\sigma_{i}^{2}\),

$$ A_{m} \sim\mathrm{Normal}\bigl(\mu_i, \sigma_i^2\bigr) $$
(5.23)

where the normal distribution implies the probability density for A m is approximated by

$$\begin{aligned} &\tilde{p}_{i}\bigl(A_{m};\mu_{i}, \sigma_{i}^{2}\bigr) \\ &\quad= (2\pi)^{-1/2}\sigma _{i}^{-1}\exp \biggl[-\frac{1}{2\sigma_{i}^{2}}(A_{m}- \mu_{i})^{2} \biggr] . \end{aligned}$$
(5.24)

While this may seem like a drastic assumption, it turns out this approximation allows us to do a surprisingly good job of inferring the distribution of the error in \(\delta\hat{a}_{i}\equiv\hat {a}_{i}-\langle A\rangle_{i}\) even for a small number of samples from each state, and generally gives an overestimate of the error (which is arguably less dangerous than an underestimate) for smaller sample sizes. While the validity of this approximation is illustrated in a subsequent example, we continue below to develop the ramifications of this approximation.

Consider the sample mean estimator for 〈A i ,

$$\begin{aligned} \hat{\mu} = & \frac{1}{N}\sum_{m=1}^{N}A_{m} . \end{aligned}$$
(5.25)

The asymptotic variance of \(\hat{\mu}\), which provides a good estimate of the statistical uncertainty in \(\hat{\mu}\) in the large-sample limit, is given as a simple consequence of the central limit theorem,

$$\begin{aligned} \delta^{2}\hat{\mu} \equiv&\mathbb{E} \bigl[\bigl(\hat{\mu}-\mathbb{E}[ \hat{\mu }]\bigr)^{2} \bigr] \\ =&\frac{\operatorname{Var}A_{m}}{N}\approx\frac{\hat{\sigma}^{2}}{N} \end{aligned}$$
(5.26)

where the unbiased estimator for the variance \(\sigma^{2}\equiv\operatorname{Var}A_{m}\) is given by

$$\begin{aligned} \hat{\sigma}^{2} \equiv& \frac{1}{N-1}\sum _{m=1}^{N}(A_{m}-\hat{\mu})^{2} \end{aligned}$$
(5.27)

Suppose we now assume the distribution of A from state i is normal (Eq. (5.24)),

$$\begin{aligned} A | \mu,\sigma^{2} \sim& \mathrm{Normal}\bigl(\mu, \sigma^{2}\bigr) . \end{aligned}$$
(5.28)

Were this to be a reasonable model, we could model the timeseries of the observable A t A(x t ) by the hierarchical process:

$$\begin{aligned} \begin{aligned} s_{t} | s_{t-1},\mathbf{T} & \sim \mathrm{Bernoulli}(T_{s_{t-1} 1},\ldots,T_{s_{t-1} N}), \\ A_{t} | \mu_{s_{t}},\sigma_{s_{t}}^{2} & \sim \mathrm{Normal}\bigl(\mu _{s_{t}},\sigma_{s_{t}}^{2} \bigr). \end{aligned} \end{aligned}$$
(5.29)

Here, the notation Bernoulli(π 1,…,π N ) denotes a Bernoulli scheme where discrete outcome n has associated probability π n of being selected. We will demonstrate below how this model does in fact recapitulate the expected behavior in the limit where there are sufficient samples from each state.

We choose the (improper) Jeffreys prior [5],

$$\begin{aligned} p\bigl(\mu,\sigma^{2}\bigr) \propto& \sigma^{-2} \end{aligned}$$
(5.30)

because it satisfies intuitively reasonable reparameterization [5] and information-theoretic [3] invariance principles. Note that this prior is uniform in (μ,logσ).

The posterior is then given by

$$\begin{aligned} &p\bigl(\mu,\sigma^{2}\bigm|\{A_{m}\}\bigr) \\ &\quad\propto \Biggl[ \prod_{n=1}^{N}p\bigl(A_{m}\bigm|\mu, \sigma^{2}\bigr) \Biggr] p\bigl(\mu,\sigma^{2}\bigr) \\ &\quad \propto \sigma^{-(N+2)} \exp \Biggl[-\frac{1}{2\sigma^{2}}\sum _{m=1}^{N}(A_{m}-\mu)^{2} \Biggr] . \end{aligned}$$
(5.31)

Rewriting in terms of the sample statistics \(\hat{\mu}\) and \(\hat{\sigma}^{2}\), we obtain

$$\begin{aligned} &p\bigl(\mu,\sigma^{2}\bigm|\{A_{m}\}\bigr) \\ &\quad \propto \sigma^{-(N+2)} \exp \Biggl\{ -\frac{1}{2\sigma^{2}} \Biggl[ \sum_{m=1}^{N}(A_{m}-\hat{ \mu})^{2} \\ &\qquad{}+N(\hat{\mu}-\mu)^{2} \Biggr] \Biggr\} \\ &\quad \propto \sigma^{-(N+2)} \exp \biggl\{ -\frac{1}{2\sigma^{2}} \bigl[(N-1)\hat{\sigma}^{2} \\ &\qquad{}+N(\hat{\mu}-\mu)^{2} \bigr] \biggr\} . \end{aligned}$$
(5.32)

The posterior has marginal distributions

$$\begin{aligned} \begin{aligned} \sigma^{2} | \{A_{m}\} & \sim \mathrm{Inv-} \chi^{2}\bigl(N-1,\hat{\sigma }^{2}\bigr), \\ \mu | \{A_{m}\} & \sim \mathrm{t}_{N-1}\bigl(\hat{\mu},\hat{ \sigma}^{2}/N\bigr) \end{aligned} \end{aligned}$$
(5.33)

where σ 2 is distributed according to scaled inverse chi-square distribution with N−1 degrees of freedom, and μ according to Student’s t-distribution with N−1 degrees of freedom that has been shifted to be centered about \(\hat{\mu}\) and whose width has been scaled by \(\hat{\sigma}^{2}/N\).

As can be seen in Fig. 5.3, as the number of degrees of freedom increases, the marginal posterior for μ approaches the normal distribution with the asymptotic behavior expected from standard frequentest analysis for the standard error of the mean, namely

$$\begin{aligned} \mu\rightarrow\mathrm{N}\bigl(\hat{\mu},\hat{\sigma}^{2}/N\bigr) . \end{aligned}$$
(5.34)

At low sample counts, the t-distribution is lower and wider than the normal distribution, meaning that confidence intervals computed from this distribution will be somewhat larger than those of the corresponding normal estimate for small samples. In some sense, this partly compensates for \(\hat{\sigma}^{2}\) being a poor estimate of the true variance for small sample sizes, which would naturally lead to underestimates of the statistical uncertainty. In any case, this is also far from the asymptotic limit where the normal distribution with variance \(\hat{\sigma}^{2}/N\) is expected to model the uncertainty well.

Fig. 5.3
figure 7

Approach to normality for marginal distribution of the mean p(μ|{A m }). For fixed \(\hat{\mu}\) and \(\hat{\sigma}^{2}\), the marginal posterior distribution of μ (red), a scaled and shifted Student t-distribution, rapidly approaches the normal distribution (black) expected from asymptotic statistics. The PDF is shown for sample sizes of N=5 (the broadest), 10, 20, and 30

The posterior can also be decomposed as

$$\begin{aligned} &p\bigl(\mu,\sigma^{2}\bigm|\{A_{m}\}\bigr) \\ &\quad = p\bigl(\mu \bigm|\sigma^{2},\{A_{m}\}\bigr) p\bigl(\sigma ^{2} \bigm|\{A_{m}\}\bigr). \end{aligned}$$
(5.35)

This readily suggests a two-step sampling scheme for generating uncorrelated samples of (μ,σ 2), in which we first sample σ 2 from its marginal distribution, and then μ from its distribution conditional on σ 2

$$\begin{aligned} \begin{aligned} \sigma^{2} | \{A_{m}\} & \sim \mathrm{Inv-} \chi^{2}\bigl(N-1,\hat{\sigma }^{2}\bigr), \\ \mu | \sigma^{2}, \{A_{m}\} & \sim \mathrm{N}\bigl(\hat{ \mu},\sigma^{2}/N\bigr). \end{aligned} \end{aligned}$$
(5.36)

Alternatively, if the scaled inverse-chi-square distribution is not available, the χ 2-distribution (among others) can be used to sample σ 2:

$$ (N-1) \bigl(\hat{\sigma}^{2}/\sigma^{2}\bigr) \bigm| \{A_{m}\} \sim \chi^{2}(N-1) $$
(5.37)

where the first argument is the shape parameter and the second argument is the scale parameter.

5.6.2 Illustration of Fully Bayesian Sampling Scheme

Using the sampling procedures described previously, we are now equipped with a scheme to sample from the joint posterior describing our confidence in that a Markov model characterized by a transition matrix T and state expectations μ i , i=1,…,M, produced the observed trajectory data. Using a set of models sampled from this posterior, we can characterize the statistical component of the uncertainty as it propagates into equilibrium averages, non-equilibrium relaxations, and (non-)equilibrium correlation measurements computed from the Markov model. To ensure the correctness of this procedure, however, we first test its ability to correctly characterize the posterior distribution for a finite-size sample from a true Markovian model system.

How can we test a Bayesian posterior distribution? One of the more powerful features of a Bayesian model is its ability to provide confidence intervals that correctly reflect the level of certainty that the true value will lie within it. For example, if the experiment were to be repeated many times, the true value of the parameter being estimated should fall within the confidence interval for a 95 % confidence level 95 % of the time. As an illustrative example, consider a biased coin where the probability of turning heads is θ. From an observed sample of N coin flips, we can estimate θ using a Binomial model for the number of coin flips that turn up heads and a conjugate Beta Jeffreys prior [3, 5]. Each time we run an experiment and generate a new independent collection of N samples, we get a different posterior estimate for θ, and a different confidence interval (Fig. 5.4, top). If we run many trials and record what fraction of the time the true (unknown) value of θ falls within the confidence interval estimated from that trial, we can see if our model is correct. If correct, the observed confidence level should match the desired confidence level (Fig. 5.4, bottom right). Deviation from parity means that the posterior is either two broad or too narrow, and that the statistical uncertainty is being either over- or underestimated (Fig. 5.4, bottom left).

Fig. 5.4
figure 8

Testing the posterior for inference of a biased coin flip experiment. Top: Posterior distribution for inferring the probability of heads, θ, for a biased coin from a sequence of N=1000 coin flips (dark line) with 95 % symmetric confidence interval about the mean (shaded area). The true probability of heads is 0.3 (vertical thick line). Posteriors from five different experiments are shown as dotted lines. Bottom left: Desired and actual confidence levels for an idealized normal posterior distribution that either overestimates (upper left curves) or underestimates (bottom right curves) the true posterior variance by different degrees. Bottom right: Desired and actual confidence levels for the Binomial-Beta posterior for the coin flip problem depicted in upper panel. Error bars show 95 % confidence intervals estimates from 1000 independent experimental trials. For inference, we use a likelihood function such that the observed number of heads is N H |θ∼Binomial(N H ,N,θ) and conjugate Jeffreys prior [3, 5] θ∼Beta(1/2,1/2) which produces posterior θ|N H ∼Beta(N H +1/2,N T +1/2) along with constraint N H +N T =N

We performed a similar test on a three-state model system, using a model (reversible, row-stochastic) transition matrix for one Markov time is given by

$$ \mathbf{T}(1) = \left [ \begin{array}{c@{\quad}c@{\quad}c} 0.86207 & 0.12931 & 0.00862\\ 0.15625 & 0.83333 & 0.01041\\ 0.00199 & 0.00199 & 0.99602 \end{array} \right ]. $$
(5.38)

Each state is characterized by a mean value of the observable A(x), fixed to 3,2, and 1 for the first, second, and third states, respectively. The equilibrium populations are π≈[0.16250.13450.7031]. Simulation from this model involves a stochastic transition according to the transition element T ij followed by observation of the value of A(x) sampled i.i.d. from the current state’s probability distribution p i (A). Multiple independent realizations of this process were carried out, and subjected to the Bayesian inference procedure for transition matrices and observables described above. The nonequilibrium relaxation \(\langle A\rangle_{\rho_{0}}\) from the initial condition ρ 0=[100] in which all density is concentrated in state 1, as well as the autocorrelation function 〈A(0)A(t)〉, is shown in Fig. 5.5.

Fig. 5.5
figure 9

Observables for three-state model system. Top: Relaxation of \(\langle A(t)\rangle_{\rho_{0}}\) (solid line) from initial distribution ρ 0=[100] to equilibrium expectation 〈A〉 (dash-dotted line). Bottom: Equilibrium autocorrelation function 〈A(0)A(t)〉 (solid line) to 〈A2 (dash-dotted line). The estimates of both \(\langle A(t)\rangle_{\rho_{0}}\) and 〈A(0)A(t)〉 at 50 timesteps (red vertical line) were assessed in the validation tests described here

With the means of p i (A) within each state fixed as above, we considered models for p i (A) that were either normal or exponential, using the probability density functions:

$$\begin{aligned} &p_{i}(A) =(2\pi)^{-1/2}\sigma_{i}^{-1} \exp \biggl[-\frac{1}{2\sigma _{i}^{2}}(A-\mu_{i})^{2} \biggr] ,\\ &\quad \mathrm{normal} \\ &p_{i}(A) =\mu_{i}^{-1}\exp[-A/ \mu_{i}] ,\\ &\quad A\ge0.\quad \mathrm{exponential} \end{aligned}$$

While the normal output distribution for p i (A) corresponds to the hierarchical Bayesian model that forms the basis for our approach, the exponential distribution is significantly different, and represents a challenging test case.

Figure 5.6 depicts the resulting uncertainty estimates for both normal (top) and exponential (bottom) densities for the observable A. In both cases, the confidence intervals are underestimated for short trajectory lengths (1 000 steps) where, in many realizations, few samples are observed in one or more states, so that the variance is underestimated or the effective asymptotic limit has not yet been reached. As the simulation length is increased to 10 000 or 100 000 steps so that it is much more likely there are a sufficient number of samples in each state to reach the asymptotic limit, however, the confidence intervals predicted by the Bayesian posterior become quite good. For the exponential model for observing values of A (which might be the case in, say, fluorescence lifetimes), we observe similar behavior. Except for what appears to be a slight, consistent underestimation of \(\langle A(t)\rangle_{\rho_{0}}\) (much less than half a standard deviation) there appears to be excellent agreement between the expected and observed confidence intervals, confirming that this method is expected to be a useful approach to modeling statistical uncertainties in equilibrium and kinetic observables.

Fig. 5.6
figure 10

Confidence interval tests for model system. Top: Expected and observed confidence intervals for three-state system with normal distribution for observable A with unit variance for simulations of length 1 000 (left), 10 000 (middle), and 100 000 (right) steps. Confidence intervals were estimated from generating 10 000 samples from the Bayesian posterior. Estimates of the fraction of observed times the true value was within the confidence interval estimated from the Bayesian posterior were computed from generating 1 000 independent experimental realizations. The resulting curves are shown for the equilibrium estimate 〈A〉 (red), nonequilibrium relaxation \(\langle A\rangle_{\rho_{0}}\) (green), and the equilibrium correlation function 〈A(0)A(t)〉 (blue). Bottom: Same as top, except an exponential distribution with the same mean was used for the probability of observing a particular value of A within each state