1 Introduction

We consider the inference problem of a partially observed continuous-time Markov chain (CTMC), written as \(X:=\{X(t);\;t \le \tau \}\), that take values on a finite state space, \({\mathbb {S}}:=\{1,\ldots ,m\}\). Such continuous-time discrete-state systems find applications in areas such as physics, Van Kampen (2007); ecology, Fukaya and Royle (2013); neuroscience, Sauer (2016); and finance, Pardoux (2008). Hence, the need for efficient inference procedures is required. The literature on CTMC is extensive, with an excellent exposition provided in, for example, the monograph by Norris (1998).

If the process is observed in full, that is all moves and times of moves between states are observed, then the likelihood is easy to derive, to evaluate and maximize. The time the process spends in each state \(j\in {\mathbb {S}}\) is exponential with parameter \(\lambda _j>0\). After this time the process moves from state j to state k with probability \(p_{j,k}\), noting that \(p_{j,j}=0\). We then write the vector of \((\lambda _j)\) for convenience in matrix form with \(\Lambda =\text{ diag }(\lambda _1,\ldots ,\lambda _m)\), a \(m\times m\) matrix. We also let \(P=(p_{j,k})\) be the \(m\times m\) matrix of transition probabilities. The process can be characterized by the intensity matrix \(G=-\Lambda + \Lambda P\).

With a fully observed process the likelihood can be written as

$$\begin{aligned} {\mathcal {L}}(G)= \left( \prod _{j\ne k}p_{j,k}^{N_{j,k}}\right) \left( \prod _{j=1}^m \lambda _j^{N_j}\,e^{-T_j\,\lambda _j}\right) , \end{aligned}$$
(1)

where \(N_{j,k}\) denotes the number of transitions from state j to state k, \(N_j\) is the number of exits to state j and \(T_j\) is the total time spent in state j, up to the final time for which the process is observed. In this case, the maximum likelihood estimator is readily obtained, see e.g. Inamura (2006),

$$\begin{aligned} {\widehat{\lambda }}_j = \frac{N_j}{T_j},&\quad \text{ and } \quad {\widehat{p}}_{j,k} = \frac{N_{j,k}}{N_j}, \quad j,k\in \left\{ 1, \ldots , m \right\} ,\,\, j \ne k. \end{aligned}$$

All the above quantities exist provided the process visits all states in \({\mathbb {S}}\).

However, in most applications, fully observed processes are not typically available and X is only partially observed at specific time points; \(X_p:=\{X(t_i)=x_i\}\), for \(i=1,\ldots ,n\). This problem of the partially observed CTMC has received considerable attention within the literature, and recent reviews can be found in Israel et al. (2001), Bladt and Sorensen (2005), and Pfeuffer (2017), for example. The latter of these references also provides an R package implementation of various proposals.

It is well known that the maximum likelihood estimator (MLE) approach has several drawbacks; even with the MLE possibly not existing. This existence issue worsens as gaps between the partially observed records increases; see Bladt and Sorensen (2005). Given all this, an EM algorithm treating the unobserved record lying between the partial observations as latent variables, is currently a popular approach, see dos Reis and Smith (2018) and Pfeuffer et al. (2019). On the other hand, a Bayesian treatment of the problem has also been presented; Bladt and Sorensen (2005) proposed a Gibbs sampler approach based on a rejection sampling algorithm for the unobserved states. A more sophisticated Gibbs sampling approach, which relies on the simulation of the CTMC over an interval given the start and end states was considered in Fearnhead and Sherlock (2006), and implemented in Pfeuffer (2017). Another approach is to use the Gibbs sampler in Rao and Teh (2013), which simulates the CTMC conditioned on possibly noisy observations.

Therefore, current approaches to tackle the inference problem for \(X_p\) prefer the use of a complete likelihood. We believe this has been in part motivated in order to achieve the existence of a simple looking complete likelihood, even if it is difficult to obtain; which in a Bayesian setting allows for the use of conjugate priors and avoidance of parameter tuning when performing Markov-chain–Monte-Carlo (MCMC) methods. However, problems with using latent variables for sampling the missing process in between observed states are the high computational cost for doing so and the possibly high correlations which slow the mixing of the sampler. An illustration of this problem in a simple scenario consisting of a binary Markov chain is presented in Appendix B.

There have been some successful applications of Bayesian inference for partially observed CTMCs without the use of latent variables though all use some form of restrictions compared to our approach. The paper of Amoros et al. (2019) considers a CTMC with only two states for the modeling of presence or absence of hepatocellular carcinoma. As the transition probabilities for the two states are known exactly, there is no need to impute missing data. So despite the overall setting of a hidden Markov model, the two states make their algorithm simpler. As well as Amoros et al. (2019), Sherlock et al. (2010) consider the two state case for which the partial and full likelihoods are available. In Georgoulas et al. (2017) the authors consider a pseudo marginal approach for population models based on CTMCs with infinite states which are then truncated into a finite state setting. As such it restricts attention to rate matrices which depend on a small number of parameters, usually less than the number of states. On the other hand, a full rate matrix has \(d(d-2)\) free parameters where d is the number of states. Zhao et al. (2016) considers the use of CTMCs for phylogenetic protein modelling. In particular, they propose a generalized linear model based parameterization for the generator matrix and use data augmentation to complete the likelihood. On the other hand, Zhao et al. (2016) consider the use of constrained rate matrices instead of full rate matrices due to their particular application in phylogenetics. In fact, we do find a suitable dependent Metropolis proposal distribution based on a parameterization of the rate matrix in terms of probability vectors and scalars. A dependent Dirichlet proposal distribution is shown in our paper to work well. Also, in Sect. 2.5, they discuss the use of MH samplers for partially observed CTMCs with the following proposals: (1) additively or multiplicatively perturbing each entry of the generator matrix; or (2) setting an independent Dirichlet prior for the transition probabilities at each state. In our experiments, we found that in many settings it is better to use a random walk proposal based on a Dirichlet distribution rather than independent ones. As we will see, our new algorithm is applied to a fully specified rate matrix.

In the present work we focus on Bayesian inference for the partially observed CTMC without using latent variables; we use the likelihood function directly by evaluation of matrix exponentials and perform posterior inference via a Metropolis–Hastings approach where the generator matrices are fully specified and not constrained. The outline is as follows: Sect. 2 presents the theory and the proposed algorithm. Section 3 includes simulation and real data studies where we compare with, while Section 4 concludes with a brief discussion.

2 Bayesian procedures

Let a CTMC be partially observed at times \(\{0, t_1,\ldots ,t_n\}\), and denote the observations as \(\{X(0),X(t_1),\ldots ,X(t_n)\}\) where, for short, we write \(X(t_i)=x_i\), and \(\Delta _i=t_{i}-t_{i-1}\), with \(t_0=0\). The probability that the process is in state k at time t after being observed in state j at time 0 is denoted by \(Q_t(j,k)\). Such a transition probability can be expressed in terms of the generator matrix G as follows:

$$\begin{aligned} Q_t(j,k)=[\exp (t\,G)]_{(j,k)}, \end{aligned}$$
(2)

which is the (jk)th element of the matrix \(\exp (tG)\); written in full as

$$\begin{aligned} \exp (tG)=\sum _{l=0}^\infty \frac{t^l}{l!}\,G^l \end{aligned}$$
(3)

with \(G^0=I\), the \(m\times m\) identity matrix. The likelihood then becomes

$$\begin{aligned} {{{\mathcal {L}}}}(G)=\prod _{i=1}^{n} Q_{\Delta _i}(x_{i-1},x_{i}) = \prod _{i=1}^{n}[\exp (\Delta _i\,G)]_{(x_{i-1},x_{i})}. \end{aligned}$$
(4)

For more on the theory presented here, see, for example, Grimmett and Stirzaker (1982).

Evaluation of the exponential of a matrix is a well studied topic, see Higham (2005), and it is readily available in most scientific computing programming languages, such as R, from the package “expm”. So computation of the likelihood is possible and simulation from the corresponding posterior distribution can be performed with a Metropolis–Hastings sampler.

On the other hand, maximization of the above likelihood is difficult due to the constraints involving the generator matrix G. Indeed, it appears to be a strategy that has not been directly tried. This issue motivated searches for suitable latent variables, which once found have provided a source of inference for both classical and Bayesian approaches alike.

In the next two subsections we consider Bayesian approaches; the first is the existing strategy involving latent variables and the second is our proposal which obviates the need for latent variables by working with the likelihood (4) directly. The former is currently the more popular approach as it is the natural Bayesian version of the EM methodology, see for example dos Reis and Smith (2018) and Pfeuffer et al. (2019).

Fig. 1
figure 1

Observed states x and y separated by a time of length t

2.1 Gibbs sampler and latent variables

In order to utilize the complete likelihood (1) given partially observed data, it is necessary to sample the complete trajectory of the CTMC between the discretely observed times where we know the states of the CTMC. An elaborate way is to sample the process forward from the start; i.e. between consecutive observed times \((t_j,t_{j+1})\) the plan is to sample the waiting time \(\Delta _1'=t_1'-t_1\) at state \(x_j\), the unobserved transition to a state \(x_1'\), the waiting time \(\Delta _2'=t_2'-t_1'\) at state \(x_1'\), the unobserved transition to state \(x_2'\) and so on; the simulated trajectory is accepted if at time \(t_{j+1}\) the process is at the observed state \(x_{j+1}\). This would be how to sample the missing process conditional on the G. See Fig 1 for an illustration, where k states have been sampled in between known states at times \(t_j\) and \(t_{j+1}\).

This is equivalent to a rejection sampling algorithm; i.e. sample the missing process and accept it if it hits the observed part of the process. With a few states this might work adequately, but obviously will run into efficiency issues when the number of states becomes large or the time elapsed between discrete observations is large. This rejection sampler approach was originally used by Bladt and Sorensen (2005) to implement a Gibbs sampler. An alternative more efficient version is given by Fearnhead and Sherlock (2006), where an algorithm for exact simulation of the CTMC between two known states, \(s_0\) at time \(t_0\) and \(s_e\) at time \(t_e>t_0\), was presented. We note that such a simulation scheme relies on the evaluation of \([\exp (t_e-t_0)\,G)]_{(s_0,s_e)}\) to determine the number of unobserved transitions k in \((t_0,t_e)\). We also note that matrix exponential evaluations are needed to calculate the likelihood (4). The usefulness of such an algorithm for the Gibbs sampling procedure of Bladt and Sorensen (2005) was discussed in Fearnhead and Sherlock (2006) and implemented in the R package of Pfeuffer (2017). In Algorithm 1 we present the corresponding pseudocode for simulation of CTMC bridges.

figure a
figure b

A Bayesian approach using latent variables can be based on sampling the complete trajectory using a Gibbs sampler where we sample iteratively from

$$\begin{aligned} \left[ \{X(t),\, t_i< t <t_{i+1}\} \, |\; G,\,\, X(t_i)=x_i,\,\, X(t_{i+1})=x_{i+1},\Delta _{i+1}=t_{i+1}-t_i \right] , \end{aligned}$$

for \(i=1,\ldots ,n-1\), and then sample

$$\begin{aligned} \left[ G \, | \; \left\{ X(t),\, 0 \le t \le \tau \right\} \right] , \end{aligned}$$

which is based on (1), suitably multiplied by the prior. Pseudocode for the Gibbs sampler is given in Algorithm 2. An advantage of such a Gibbs sampling strategy is that the algorithm is automatic, in the sense that no tuning is required. Another alternative is to use the Gibbs sampler of Rao and Teh (2013), presented in Algorithm 3, where uniformization is used to draw a CTMC full trajectory conditioned on partial observations from a previously drawn full trajectory and the forward filtering backward sampling algorithm is used to resample the states. In contrast with the previous Gibbs sampler evaluation of matrix exponentials, which can be computationally expensive for large dimensions, this algorithm does not require such computations. The corresponding Gibbs sampler is presented in Algorithm 4.

figure c
figure d

2.2 Metropolis–Hastings sampler

In this subsection we propose a more direct Bayesian approach which does not rely on the introduction of latent variables. We do this by considering a Metropolis–Hastings sampler for the posterior distribution based on the likelihood (4). The Metropolis–Hastings algorithm works by introducing a proposal distribution for the parameters of interest. The proposals for the parameters in the matrices \(\Lambda \) and P are given as follows: Regarding the diagonal matrix \(\Lambda \) with j-th diagonal entry, denoted by \(\lambda _j\), proposal values are given by \(q(\lambda '_j|\lambda _j)\), for each \(j\in \left\{ 1, \ldots , m\right\} \), to be a LogNormal distribution with mean \(\log \lambda _j\) and standard deviation \(\sigma ^2_{j}=\sigma ^2\). For the stochastic matrix P, we consider three possible proposal distributions. Let \(\pmb {p}_j\) denote the vector consisting of the off-diagonal entries in each row of P.

The first proposal is given by \(q(\pmb {p}_{j}'|\pmb {p}_j)\) to be a Dirichlet distribution with parameters \(c_j \pmb {p}_j\) with \(c_j\in {\mathbb {R}}^+\). For such a Dirichlet proposal the mean for \(\pmb {p}_j'\) is \(\pmb {p}_j\). An alternative choice is to let \(q(\pmb {p}_{j}'|\pmb {p}_j)\) be a Dirichlet distribution with parameter \(c_j \pmb {p}_j+\pmb {1}\), where \(\pmb {1}\) is a vector consisting of entries 1. Such a Dirichlet distribution has mode \(\pmb {p}_j\). In practice we found this latter proposal is more robust. Finally we consider a proposal \(q(p_{j,k}'|p_{j,k})\) to be a LogitNormal distribution with mean \(\text {logit}(p_{j,k})\). If a draw from such a proposal is accepted then we have to normalize the vector \(\pmb {p}_j\) by the sum of its entries, so the matrix P remains a stochastic matrix. This proposal distribution is more suitable when some entries of P are zero or close to zero.

figure e

Each time a proposal is made, we recompute the likelihood function in order to determine whether the proposal is accepted. We illustrate with the update for \(\lambda _1\); we sample \(\lambda _1'\) from \(q(\lambda _1'|\lambda _1)\) and accept this move with probability

$$\begin{aligned} \alpha =\min \left\{ 1,\frac{{{{\mathcal {L}}}}(G')\,\pi (\Lambda ')q(\lambda _1|\lambda _1')}{{{{\mathcal {L}}}}(G)\,\pi (\Lambda )q(\lambda _1'|\lambda _1)}\right\} , \end{aligned}$$

where \(G'=-\Lambda '+ \Lambda 'P\) and \(\Lambda '=(\lambda _1',\lambda _2,\ldots , \lambda _m)\), while \(G=-\Lambda +\Lambda P\) and \(\Lambda =(\lambda _1,\lambda _2,\ldots ,\lambda _m)\). A similar and obvious procedure follows for the other parameters making up \(\Lambda \) and P, a complete summary of the algorithm is presented in Appendix 1. In Algorithm 5 we give pseudocode for the Metropolis–Hastings sampler. The exponential of the matrix G is calculated using the scaling and squaring method of Higham (2005) as implemented in the Julia LinearAlgebra library https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/. Alternatively, when the number of states becomes prohibitively large we can use the algorithm of Al-Mohy and Higham (2011) by calling the Python implementation of the SciPy library (Virtanen et al. 2020).

2.3 Prior selection

With such a Bayesian framework, the setting of prior distributions is relatively straightforward. For example, we assign independent gamma priors to each \(\lambda _j\), with shape and rate parameters \(\alpha _j,\beta _j \in {\mathbb {R}}^+\). Each row of P can be assigned a Dirichlet prior with parameters chosen so the prior is uniform on the simplex. In previous Bayesian analyses the prior distribution was assigned on the off-diagonal elements of the matrix G, e.g. Bladt and Sorensen (2005), Pfeuffer (2017). The main motivation of this was to assign gamma priors so that there is conjugacy when considering the complete likelihood (1). However, a prior assignment at the level of the matrices \(\Lambda \) and P is useful to elicit expert information given the interpretation of the matrices. The jth diagonal elements of \(\Lambda \) give the rates of the exponential times the Markov chain waits in state j and the jth row of P gives transition probabilities for the corresponding state changes. In particular, for the simulation studies where we compare Gibbs samplers with the Metropolis–Hastings algorithm, we will use gamma priors with equal shape parameters so that it coincides with the Dirichlet prior on the rows of P, with gamma priors on the diagonal entries of D, as discussed above.

3 Illustrations

We programmed the Metropolis–Hastings algorithm, and the two Gibbs samplers in the Julia programming language with the code available at https://github.com/alan7riva/CTMC. Overall, we observe that the Metropolis–Hastings algorithm is not only significantly simpler to code but it is also faster and has better mixing throughout our experiments.

3.1 Two state toy example

Here we consider a two state problem with generator matrix

$$\begin{aligned} G = \begin{pmatrix} -\lambda _1 & \lambda _1 \\ \lambda _2 & - \lambda _2 \end{pmatrix} \end{aligned}$$

for \(\lambda _1,\lambda _2>0\). In such a case we know explicitly \(Q(t) = \exp (tG)\) and assuming that we discretely observe a Markov chain associated with this generator matrix G over a time mesh \(\Delta k\), \(\Delta \in {\mathbb {R}}^+\) and \(k \in \{1,\ldots , n\}\) for some \(n\in {\mathbb {N}}\), the likelihood \({\mathcal {L}}(\lambda )\) is also computable; details are given in Appendix B. We drew simulations for the above CTMC with \(\pmb {\lambda }=(\lambda _1,\lambda _2)\in \left\{ (1,1),\,(2,1)\right\} \). The prior distributions were \(\lambda \sim \text {Gamma}(2,1)\) for all the experiments. The variance in the LogNormal proposals for \(\lambda _i\) were set to 0.5. A CTMC was generated until 1000 transitions were obtained and we considered the discrete observations given by times \(\Delta k\) with \(\Delta = 1.0\) and \(1\le k \le \min \left\{ n \, ;\ X(n\Delta )\text { was fully observed}\right\} \). For the simulation studies with rates \(\pmb {\lambda }\) we obtained, respectively, 1014 and 1482 partial observations. In Figs. 2 and 3 we show the posterior fits of \(\lambda _1\) and \(\lambda _2\) respectively with 50000 iterations of MCMC after diagnosing convergence with the potential scale reduction factor of Gelman and Rubin (1992) below 1.01 calculated with 4 chains started at random.

Fig. 2
figure 2

Marginal histograms of \(\lambda _1\) for Gibbs samplers 1, 2 and Metropolis–Hastings with 5000 iterations after diagnosing convergence, with potential scale reduction, compared with the true posterior distribution. True \(\pmb {\lambda }=(1,1)\)

Fig. 3
figure 3

Marginal histograms of \(\lambda _2\) for Gibbs samplers 1, 2 and Metropolis–Hastings with 5000 iterations after diagnosing convergence, with potential scale reduction, compared with the the posterior distribution. True \(\pmb {\lambda }=(2,1)\)

In Tables 1 and 2 we show the computation time, effective sample sizes (ESS) and ESS per second of computation before diagnosing convergence, all averaged over the 4 chains started at random values; the same quantities are also presented for a 50000 iterations run started at the final value of one of the 4 pilot runs used to assess convergence, the choice of such run was sampled from a uniform distribution.

We observe that the Metropolis–Hastings sampler is significantly faster than the Gibbs samplers, resulting in bigger ESS per second. In our experiments Gibbs sampler 1 outperforms Gibbs sampler 2 due to the latter being considerably slow. For this reason in the following studies we focus our comparison only on Gibbs sampler 1.

Table 1 Gibbs and Metropolis–Hastings comparison for toy simulation study with \(\pmb {\lambda }=(1,1)\)
Table 2 Gibbs and Metropolis–Hastings comparison for toy simulation study with \(\pmb {\lambda }=(2,1)\)

3.2 Five state simulation example

In this study, we draw observations from a CTMC with state space \({\mathbb {S}}=\left\{ 1,2,3,4,5 \right\} \). In all the experiments presented in this section, the stochastic matrix \(P=(p_{j,k})\) associated with the Markov Chain is chosen such that \(p_{j,k}=0.25\) for \(j\ne k\). On the other hand we take \(\Lambda =(\lambda , \lambda ,\lambda ,\lambda ,\lambda )\) for 3 types of simulation using \(\lambda \in \left\{ 0.25,0.5, 1,2,4\right\} \). Let n be the total number of observations, for which we take the inter-arrival times \(\Delta _i = \Delta \) for \(i\in \left\{ 1,\ldots , n \right\} \). For different values of \(\lambda \) and \(\Delta \) we choose the number of observations \(n(\lambda ,\Delta )=5000 \lambda \Delta \), so we have about \({\mathbb {E}}\sum _{i=1}^n T_i/\Delta = 5000 \) observations in each study. The prior distributions were \(\lambda _i\sim \text {Exp}(1)\), \(i\in \left\{ 1,\ldots , 5\right\} \) and \(p_{j,k}\sim \text {Gamma}(0.25,1)\), \(j\in \left\{ 1,\ldots , 5\right\} \), \(k\in \left\{ 1,\ldots , 4\right\} \) pairwise independently for all the experiments. The variance in the LogNormal proposals for \(\lambda _i\) were set to 0.5. For the proposals of \(\pmb {p}_j\) we used a Dirichlet with common parameter 0.5. We ran the Gibbs and Metropolis–Hastings sampler; i.e. Algorithms 2 and 5, respectively, with 10000 iterations for each experiment. In Fig. 4, we present the mean generators obtained from each sampler after removing 3000 burn-in iterations with \(\Delta =1\). Respective figures for \(\Delta \in \left\{ 0.25,0.5,2,4\right\} \) are presented in the supplementary material. We observe that when the scale \(1/\lambda \) is equal to the observation window \(\Delta \), both the Gibbs sampler 1 and Metropolis–Hastings chains produce a mean generator which fits well with the true values; with the Metropolis–Hastings performance being faster so allowing for better estimation if desired.

Table 3 Frobenius distance between estimated posterior mean and true generator matrix across 5 by 5 simulation studies

When the scale of the waiting times for the true process to change state are smaller than the observation window \(\Delta \), one could expect the Gibbs sampler, which uses as auxiliary variables the unobserved transitions, to perform better than the Metropolis–Hastings sampler. However, we observe that this is not the case; for instance, when \(\Delta =0.5\) with scale 0.125 we have that the Frobenius distance between the true generator matrix and the mean generator associated with the Metropolis–Hastings sampler is 5.79, while for the Gibbs sampler we get a distance of 6.63. Table 3 shows for each experiment the computation time and Frobenius distance to the true generator matrix for the Metropolis–Hastings and Gibbs samplers as well as for the EM algorithm, which is usually used in credit risk applications (see for instance Pfeuffer (2017)). All starting points are the same for the three methods. Another advantage of the Metropolis–Hasting algorithm showcased by these experiments was again the improvement of computation times as well as their stability across experiments in comparison to the Gibbs sampler where depending on the observation window and true scale the algorithm could be slower due to the bridges simulations. Similar behavior was observed for Gibbs sampler 2. Table 4 shows computation time.

Table 4 Computation time for 5 by 5 simulation studies
Fig. 4
figure 4

Comparison of estimated generator matrices for Metropolis–Hastings and Gibbs chains for \(\Delta = 1\) with \(\lambda = 4\) (first row), \(\lambda =2\) (second row) and \(\lambda = 1\) (third row)

3.3 Credit risk analysis

In Pfeuffer (2017) a package for analyzing continuous-time Markov chain models with partially observed data is presented. In particular, they implement the Gibbs sampler of Bladt and Sorensen (2005) with a default setting to use the exact simulation of the Markov chain over an interval given the start and end states, as presented and discussed in Fearnhead and Sherlock (2006), rather than the acceptance–rejection sampling algorithm of Bladt and Sorensen (2005). The foremost application of the package is in credit risk where the Markov chain states \(\left\{ \text {AAA}, \text {AA}, \text {A}, \text {BBB}, \text {BB}, \text {B}, \text {C}, \text {D} \right\} \) correspond to credit ratings. In Fig. 5 we show a comparison of the new Metropolis–Hastings algorithm, using the LogitNormal proposal in P with the Gibbs sampler for the credit risk application presented in Pfeuffer (2017). The choice of the priors was taken so they coincide for both samplers. Here we take the priors for each \(\lambda _j\) as \(\text{ Gamma }(7,5)\) and the Dirichlet distribution for each \(\pmb {p}_j\) as Dirichlet with common parameter 1. The variance in the LogNormal and LogitNormal proposals are taken to be 0.5. We observe that the fitted values for the mean generator matrices, obtained from a run of length 100000 of which the first 10000 iterations were discarded as burn in, are close to the Gibbs mean values; the Frobenius distance between the matrices is less than 0.11. Hence, we have shown in a not too demanding scenario that the use of latent variables is not necessary and equal performance can be achieved using the observed likelihood.

Fig. 5
figure 5

Estimated generator matrices for Metropolis–Hastings and Gibbs chains for the credit risk data of Pfeuffer (2017)

4 Discussion

Bayesian inference for a partially observed CTMC, without the use of latent variables, has not, to the best of our knowledge, been considered before. We argue that the use of latent variables in this problem is unnecessary and leads to more complex and difficult implementation algorithms. On the other hand, the use of and convergence of the Metropolis–Hastings MCMC sampler is both simpler and faster, as showcased in Sect. 3.1. The Metropolis–Hastings approach is computationally simple yet efficient in comparison with Gibbs samplers based on imputing missing latent observations in the form of CTMC bridges between the partial observations. Both algorithms accurately target the posterior distributions, however high correlations in the bridge sampler steps cause slow mixing and small effective sample sizes per second in this particular setting. The only complicated aspect to our algorithm is the computation of the exponential of a matrix for which adequate software is now currently available. Our approach is general as it allows for the inference of fully specified generator matrices, i.e. without further constraints than having negative values in the diagonal, non-negative values off the diagonal and zero sum rows. Two approaches were proposed, one for generator matrices with non-zero off-diagonal entries and one for the possibility of such zero entries.