Keywords

5.1 Introduction

In many areas of science and engineering, we are interested in modelling uncertainty about the behaviour of dynamical systems, that is, systems whose state changes as time passes. For instance, we may want to model the evolution of the spatial trajectories of a system in motion; or the performance and reliability of a complex composite system as its components wear out, break down and get replaced; or the spread of pathogens through a population; or the evolution of stock prices—and so on and so forth.

What all these systems have in common is that there is a dynamic component to their description—they change over time—and they are, in a sense, hard to describe exactly. For instance, this difficulty may arise because their behaviour depends on unknown external influences or because the system cannot reasonably be described at a sufficiently detailed level. Thus, there arises an uncertainty about how exactly the system will evolve over time, even if one can model how it will ‘roughly’ behave. Regardless of the interpretation that we want to assign to this uncertainty, such systems are modelled using stochastic processes. A stochastic process, then, is a probabilistic description of the system under study. In this sense, it provides a formal and integrated description of the system dynamics and the probabilistic uncertainty of its evolution.

On the other hand, we might also be uncertain about whether such a model is ‘correct’. For instance, we might not know exactly the numerical values that the parameters of our model should take. Similarly, we might be aware that our modelling assumptions lead to simplifications that are not necessarily warranted, which introduces uncertainty about the accuracy or applicability of any assessments made on the basis of these models. It is therefore of interest to robustify our models also against these kinds of ‘meta’, or ‘higher-order’, uncertainties.

In this chapter, we consider stochastic processes for which this higher-order uncertainty is modelled using the theory of imprecise probabilities (IP). For an extended introduction to IP, we refer the reader back to Chap. 2. We constrain ourselves to briefly recalling that such imprecise probabilistic models can be interpreted as representing a set of traditional probabilistic models. So, in our current setting, we will be considering sets of stochastic processes. From an inference point of view, the aim is then to compute inferences which are robust with respect to variations within such a set. We recall from Chap. 2 that these robust inferences are captured in general by the lower and upper expectations with respect to the elements of the set that we are considering.

Our aim with the present chapter is to provide an extensive but intuitive introduction to the theory of imprecise stochastic processes and of imprecise Markov chains in particular. To this end, we will intentionally focus on the different representations of these processes. We will show how each of the different ways of looking at these models provides its own way of deriving useful properties and highlights different intuitive ways of reasoning about them. Important results and properties are stated, but we have made an effort to keep the discussion intuitive. We try to prevent technicalities and do not provide extended proofs; instead, we will provide pointers to the literature that the interested reader might pursue herself.

The remainder of this chapter is organised as follows. We start the discussion by giving a quick introduction to stochastic processes in Sect. 5.2. The first part basically uses the measure-theoretic approach (albeit in a rather simplified sense) to pin down some first concepts and notation. We then go on to present three different and graphical representations of stochastic processes, which can be used when the time-dimension is discrete. Specifically, we cover the representation using probability trees, in Sect. 5.2.1; using Bayesian networks, in Sect. 5.2.2 and using transition graphs, in Sect. 5.2.3.

Once we have developed these different ways of reasoning about discrete-time processes, we generalise the discussion to imprecise discrete-time processes in Sect. 5.3. We use the previously developed graphical notions to provide intuition about how to reason and compute inferences using these models. The treatment of (imprecise) continuous-time processes is largely postponed until Sect. 5.4. Here the graphical and intuitive representations largely break down, but we can then use the previously developed understanding of the discrete-time case to reason about these models. To keep the main text as readable as possible, the discussion of the literature on which the material in this chapter is based is deferred to Sect. 5.5.

5.2 (Precise) Stochastic Processes

We will start the exposition around stochastic processes in a relatively general and abstract sense but will quickly make things more specific. Throughout the remainder of this chapter, we will consider some fixed abstract state-space \(\mathcal {X}\). A state is an element \(x\in \mathcal {X}\) and represents uniquely the relevant information about the underlying system that we are interested in modelling. So as not to complicate matters, we will assume throughout that \(\mathcal {X}\) is finite, so that we can identify it without loss of generality as the set \(\mathcal {X}=\{1,\ldots ,k\}\subset \mathbb {N}\). Note that here and in what follows, we denote with \(\mathbb {N}\) the natural numbers and will write \(\mathbb {N}_{0}:\!=\mathbb {N}\cup \{0\}\) when we include zero. Furthermore, the real numbers are written \(\mathbb {R}\), the non-negative reals are \(\mathbb {R}_{\geq 0}\) and the positive reals are \(\mathbb {R}_{>0}\).

Because we are interested in modelling a system whose state \(x\in \mathcal {X}\) changes over time, we next identify some time-dimension \(\mathbb {T}\). A crucial choice to be made later on is whether we are considering processes in discrete-time, in which case we identify \(\mathbb {T}=\mathbb {N}_{0}\), or processes in continuous-time, in which case \(\mathbb {T}=\mathbb {R}_{\geq 0}\). For now we simply keep the discussion general without making this identification.

With the state-space and time-dimension in place, it now makes sense to talk about the realisation of some (yet to be identified) stochastic process. Such a realisation is also called a sample path, and it is a function \(\omega :\mathbb {T}\to \mathcal {X}\). So, this ω describes for each point in time \(t\in \mathbb {T}\) the state \(\omega (t)\in \mathcal {X}\) that the system was in at that time. We collect in the set Ω all these sample paths. For technical reasons, it is sometimes required to restrict attention to paths that satisfy sufficient smoothness conditions; for instance, when \(\mathbb {T}=\mathbb {R}_{\geq 0}\), it is common practice to let Ω only contain càdlàg functions, that is, paths ω(t) that are right-continuous and whose left-sided limits exist everywhere.

This set Ω thus contains all possible ways in which the system might behave over time; it can therefore be considered an outcome space of a stochastic model. Formally, we will consider some abstract underlying probability space \((\varOmega ,\mathcal {F},P)\), where \(\mathcal {F}\) is some appropriate σ-algebra on Ω and where P is a probability measure on \((\varOmega ,\mathcal {F})\). Given this probability space, we can finally formalise the notion of a stochastic process as a collection \(\{X_t\}_{t\in \mathbb {T}}\) of random variables associated to this probability space. We will here slightly restrict our definition to the following specific stochastic process:

Definition 5.1 (Stochastic process)

Fix a time-dimension \(\mathbb {T}\) and consider a probability space \((\varOmega ,\mathcal {F},P)\). Then (the corresponding) stochastic process is the collection \(\{X_t\}_{t\in \mathbb {T}}\) of random variables \(X_t:\varOmega \to \mathcal {X}:\omega \mapsto \omega (t)\), \(t\in \mathbb {T}\), on this space.

Corollary 5.1

Fix a time-dimension \(\mathbb {T}\) ; consider a probability space \((\varOmega ,\mathcal {F},P)\) ; and let \(\{X_t\}_{t\in \mathbb {T}}\) be the corresponding stochastic process. Then for all \(t\in \mathbb {T}\) and \(x\in \mathcal {X}\) , it holds that \(\Pr (X_t=x) = P\bigl ( \{\omega \in \varOmega \,:\,\omega (t)=x\} \bigr )\).

Proof

Fix \(t\in \mathbb {T}\), and recall the definition of a random variable: for all \(x\in \mathcal {X}\), the probability \(\Pr (X_t=x)\) of X t taking the value x is equal to \(P\bigl (X_t^{-1}(x)\bigr )\), the measure of its preimage in Ω. Since X t(ω) = ω(t), we have \(X_t^{-1}(x)=\{\omega \in \varOmega \,:\,\omega (t)=x\}\). □

The above is a formal way of saying that, and how, these random variables \(\{X_t\}_{t\in \mathbb {T}}\) are associated to the given probability space. In words, for some fixed time \(t\in \mathbb {T}\), X t is a random variable that takes on a value \(x\in \mathcal {X}\) with probability equal to the measure of the set of paths along which the state at time t is x. Conversely, if we fix the outcome ω ∈ Ω, then the collection \(\{X_t\}_{t\in \mathbb {T}}\) can be considered a deterministic process, and X t(ω) = ω(t) for all \(t\in \mathcal {X}\).

Note, therefore, that all the quantitative information about the probability of the process taking on certain values at given points in time are completely determined by the measure P. It is therefore also intuitive to instead consider this measure P to be ‘the stochastic process’, although this is technically an abuse of terminology. This is because, for a given probability space \((\varOmega ,\mathcal {F},P)\), it is possible to define many different stochastic processes; any \(\mathbb {T}\)-indexed collection of random variables on this space satisfies the general definition. However, in a sense, the stochastic process in Definition 5.1 can be viewed as the ‘canonical’ stochastic process corresponding to the given probability space, since it specifically and exactly represents the uncertainty about which states might be obtained at different points in time. We will therefore, and for notational convenience, often refer to the measure P and its corresponding stochastic process \(\{X_t\}_{t\in \mathbb {T}}\) interchangeably and without confusion.

Next, it will be convenient to have a standardised notation to index a subset of the random variables of a stochastic process. To this end, for any finite sequence of time points t = t 1, …, t n in \(\mathbb {T}\), with \(n\in \mathbb {N}\), we will write \(X_{\mathbf {t}}=X_{t_1},\ldots ,X_{t_n}\). Typically, these sequences will be taken to be ordered, so that t 1 < ⋯ < t n. Note that each of the random variables \(X_{t_i}\), i = 1, …, n takes values in \(\mathcal {X}\). Hence, the sequence X t takes values (jointly) in \(\mathcal {X}^n=\times _{i=1}^n\mathcal {X}\). An element of this joint state-space is thus a vector \((x_1,\ldots ,x_n)\in \mathcal {X}^n\). When we are explicitly talking about a sequence t of n time points, we will also write x t to denote a generic element of \(\mathcal {X}^n\).

In what follows, we will be interested in computing the expectation of some real-valued function, whose value depends on the specific realisation of the stochastic process. To prevent technical difficulties, we will assume that this function only depends on a finite number of time points; without loss of generality, we can then assume that it is a map \(f:\mathcal {X}^n\to \mathbb {R}\), with \(n\in \mathbb {N}\), whose value depends on the n random variables X t, with t = t 1, …, t n in \(\mathbb {T}\). We collect in the set \(\mathcal {L}(\mathcal {X}^n)\) all such real-valued functions on \(\mathcal {X}^n\). The expected value of any such \(f\in \mathcal {L}(\mathcal {X}^n)\) on the n time points t is defined as

$$\displaystyle \begin{aligned} \mathbb{E}_P\bigl[f(X_{\mathbf{t}})\bigr] :\!= \sum_{x_{\mathbf{t}}\in\mathcal{X}^n}f(x_{\mathbf{t}}) P(X_{\mathbf{t}}=x_{\mathbf{t}})\,, \end{aligned} $$
(5.1)

where we have implicitly introduced the intuitive notation for the set

$$\displaystyle \begin{aligned} (X_{\mathbf{t}}=x_{\mathbf{t}}) :\!= \Big\{\omega\in\varOmega\,:\,\big(\forall i\in\{1,\ldots,n\}:\omega(t_i)=x_{t_i}\big)\Big\}\,. \end{aligned}$$

In Eq. (5.1), we use the subscript P for the expectation operator \(\mathbb {E}_P\) to make explicit that it is taken with respect to the measure P; this will be notationally convenient further on.

We finish this first introduction by recalling the notion of conditional probabilities and conditional expectations. For any two finite sequences of time points t and s in \(\mathbb {T}\), the conditional probability of X t, given X s, is derived using Bayes’ rule:

$$\displaystyle \begin{aligned} P(X_{\mathbf{t}}\,\vert\,X_{\mathbf{s}}) :\!= \frac{P(X_{\mathbf{s}},X_{\mathbf{t}})}{P(X_{\mathbf{s}})}\,, \end{aligned}$$

whenever P(X s) is strictly positive. The necessity of the final condition is obvious; it leads to a division by zero whenever it does not hold.

Using this notion of conditional probability, we can define conditional expectations analogously. Suppose the sequences s and t are of length \(n,m\in \mathbb {N}\), respectively. Then for any \(f\in \mathcal {L}(\mathcal {X}^{n+m})\) on X s, X t we define, for all \(x_{\mathbf {s}}\in \mathcal {X}^n\),

$$\displaystyle \begin{aligned} \mathbb{E}_P\bigl[f(X_{\mathbf{s}},X_{\mathbf{t}})\,\big\vert\,X_{\mathbf{s}}=x_{\mathbf{s}}\bigr] :\!= \sum_{x_{\mathbf{t}}\in\mathcal{X}^m} f(x_{\mathbf{s}},x_{\mathbf{t}})P(X_{\mathbf{t}}=x_{\mathbf{t}}\,\vert\,X_{\mathbf{s}}=x_{\mathbf{s}})\,. \end{aligned}$$

5.2.1 Probability Trees

The preceding discussion introduced stochastic processes in a very general, but rather abstract sense. We will build further intuition by next offering a different view and representation, by means of probability trees. In the remainder of this section, unless otherwise specified, we will focus on discrete-time stochastic processes, whence we identify \(\mathbb {T}=\mathbb {N}_{0}\).

We next need some notation and definitions for ‘partial paths’, which in this setting are also called situations. As before, a (full) path is a map \(\omega :\mathbb {N}_{0}\to \mathcal {X}\). In contrast, a situation is defined as a (finite length) prefix of such a path. In other words, a situation is an element of a set \(\mathcal {X}^{n}\), for some \(n\in \mathbb {N}\). If \(w\in \mathcal {X}^n\), \(n\in \mathbb {N}\), is a situation, we write w i for its (i + 1)-th coordinate, i ∈{0, …, n − 1}, and we say that its length is \(\lvert w\rvert =n\). Note that the indexing over the coordinates is taken to start from zero rather than one—this is done for notational consistency with paths ω. Since we will need to refer to it so often, we introduce the shorthand notation w for the last element of w; so if w has length n, then w  :   = w n−1. The set of all non-empty situations is \(\mathcal {X}^*:\!= \cup _{n\in \mathbb {N}}\mathcal {X}^n\), and we define , where we add the empty situation denoted by .

As a final point in this notational digression, for any \(s,t\in \mathbb {N}_{0}\) such that s ≤ t, we will introduce the shorthand notation s : t to denote the sequence of time points s, …, t. Using our previously introduced notation, we can then write X s:t for the random variables at these time points. Furthermore, for any \(n\in \mathbb {N}_{0}\) and any situation \(w\in \mathcal {X}^{n+1}\), we can then use the previously introduced notation to write X 0:n = w; this is understood to mean that the random variables at time points 0, …, n obtained the states corresponding to the situation w.

We endow the set with the prefix order, denoted ≺, which is a partial order such that for all \(v\in \mathcal {X}^*\) and for all \(v,w\in \mathcal {X}^*\) with lengths \(n=\lvert v\rvert \) and \(m=\lvert w\rvert \), it holds that v ≺ w if and only if n < m and v i = w i for all i ∈{0, …, n − 1}. This is just a rigorous but somewhat obfuscated way of saying that v ≺ w if ‘v is the beginning of w’ or ‘w is what you can get if v happens first, and then some other things happen’ or, indeed, ‘v is a prefix of w’.

The important thing to notice is that the ordered set induces a graphical tree structure, with all the situations as its vertices. This tree is what is known as the event tree. It has as its root, and, for all , w is a descendant of v exactly if v ≺ w. An example of such a tree is shown in Fig. 5.1, which (partially) shows the event tree corresponding to a binary state-space \(\mathcal {X}=\{a,b\}\).

Fig. 5.1
figure 1

A (partial) event tree for a binary state-space \(\mathcal {X}=\{a,b\}\). The vertices are situations, i.e. elements of , and the edges are induced by the prefix order ≺. Dashed lines represent branches that are not shown in the figure. The tree has been augmented to a probability tree, by assigning to each \(w\in \mathcal {X}^*\) a local model p w. A time axis represents at which point in time the situations can occur

Such an event tree can be turned into an intuitive representation of a stochastic process by augmenting it into a probability tree. This is done by assigning to each situation in the tree a local model p w, which is a probability mass function on \(\mathcal {X}\); that is, it is a map \(p_w:\mathcal {X}\to \mathbb {R}_{\geq 0}\) such that \(\sum _{x\in \mathcal {X}}p_w(x)=1\). An example of this is again illustrated in Fig. 5.1.

Definition 5.2 (Probability tree)

A probability tree is a tuple , where is the set of all situations, ≺ is the prefix order on and represents all local models, so that \(\sum _{x\in \mathcal {X}}p_w(x)=1\) for all .

The mechanism by which a stochastic process obtains a certain realisation ω ∈ Ω can now be interpreted as performing a weighted, random walk along this probability tree, starting from . Following the tree in Fig. 5.1, this is done as follows: from , we transition either to a, with probability , or to b, with probability . Suppose we transition to a. From this new situation, the next step will take us either to aa, with probability p a(a), or to ab, with probability p a(b). Proceeding in this fashion, an infinite random walk along this tree generates a full path \(\omega :\mathbb {N}_{0}\to \mathcal {X}\), where, for all \(t\in \mathbb {N}_{0}\), the state ω(t) represents the (randomly chosen) branch that we took along the tree at the (t + 1)-th step.

This ‘path construction’ view allows us also to connect back to the measure-theoretic definition that we encountered earlier. To obtain this correspondence in one direction, fix a probability tree and let \((\varOmega ,\mathcal {F})\) be an appropriate measurable space of discrete-time sample paths, on which we will aim to construct the measure P quantifying, in the measure-theoretic sense, the uncertainty of the corresponding stochastic process \(\{X_t\}_{t\in \mathbb {N}_{0}}\) on the resulting probability space.

We now reason intuitively by using the ‘random walk’ along the probability tree. Starting from , we transition to a first situation \(x\in \mathcal {X}\) with probability . From there, we could then perform the entire infinite random walk to generate the remainder of the path. So, a different way of saying this is that, of all the random paths ω ∈ Ω that could be generated, a fraction of of them will start with ω(0) = x. Using also the interpretation given by Corollary 5.1, it therefore makes sense to define the first-step marginal measure for all \(x\in \mathcal {X}\).

Let us now consider the next step, and assume the first step down the tree resulted in a situation \(x\in \mathcal {X}\). Then, with probability p x(y), \(y\in \mathcal {X}\), the next situation will be xy. In terms of paths that could be generated, a fraction of p x(y) of the paths that satisfy ω(0) = x will furthermore satisfy ω(1) = y. Therefore, we define for the second-step marginal measure .

Proceeding in this manner, for every situation \(w\in \mathcal {X}^*\) with length n + 1, \(n\in \mathbb {N}_{0}\), we can compute the (n + 1)-th step marginal measure as

or in words, by multiplying all probabilities given by the local models of the situations encountered on the path from the root of the tree, down to the situation w.

A fundamental result in the measure-theoretic treatment of stochastic processes (known as the Kolmogorov extension theorem) states that the collection of all these n-th step marginal measures P induces (‘coherently’) a probability measure P on \((\varOmega ,\mathcal {F})\). Specifically, the finite n-th step marginals of P will correspond exactly to these n-th step marginal measures that we constructed from the probability tree. This establishes the connection between probability trees and discrete-time measure-theoretic stochastic processes, in that the latter can be constructed from the former.

For the other direction, so, to construct a probability tree from a given probability space \((\varOmega ,\mathcal {F},P)\), we start with an event tree and aim to construct the local models p (⋅). Using the intuitive interpretation offered by Corollary 5.1, we start by setting for all \(x\in \mathcal {X}\). For all other situations \(w\in \mathcal {X}^*\) with length n + 1, \(n\in \mathbb {N}_{0}\), the local model p w is defined as the conditional measure constructed from Bayes’ rule, i.e. for all \(x\in \mathcal {X}\),

$$\displaystyle \begin{aligned} p_w(x) &= P\bigl( X_{n+1}=x\,\big\vert\, X_{0:n}=w \bigr) = \frac{P\bigl(X_{0:n}=w, X_{n+1}=x\bigr)}{P\bigl( X_{0:n}=w \bigr)}\,. \end{aligned} $$
(5.2)

This also establishes the connection in the other direction. It can be verified that, by now constructing from this probability tree a measure P , say, in the manner described above, we obtain again P  = P; so, we conclude that this yields a one-to-one correspondence between probability trees and measure-theoretic stochastic processes.

It should be noted that the second direction in the preceding discussion has one (rather large) caveat: it does not work when there are partial paths that have zero probability to occur. This is because then Bayes’ rule cannot define the conditional measure required to construct the local model for the situation corresponding to that partial path, since it would result in a division by zero.

To summarise, we can conclude that there is indeed a correspondence between the two representations that we have seen so far (up to some technical difficulties surrounding probabilities that are zero). We have seen that the graphical tree structure allows us to reason intuitively about how a stochastic process generates a sample path, by ‘walking’ from the root of the tree down its branches. As we will discuss next, we can also use this structure to ‘reason backwards’: from vertices deep down in the tree back to the root. We will see that this allows one to intuitively derive computational methods for working with stochastic processes.

So, fix \(n\in \mathbb {N}_{0}\), and let \(f\in \mathcal {L}(\mathcal {X}^{n+1})\) be a real-valued function for which we aim to compute the expected value with respect to the random variables X 0:n at the time points \(0,\ldots ,n\in \mathbb {N}_{0}\). Note that it suffices to consider this case, in the sense that any function defined on a subset of the variables X 0:n, can always be trivially extended to a function on all of them. Now first notice the following. For any situation \(w\in \mathcal {X}^*\) with length \(\lvert w\rvert = n+1\), the value of f in w is easy to compute; it is simply f(w). Hence in particular, the expected value of f, in w, is simply

$$\displaystyle \begin{aligned} \mathbb{E}\bigl[f(X_{0:n})\,\big\vert\,X_{0:n}=w\bigr] = f(w)\,. \end{aligned}$$

Recall that the situation w represents a node in the event tree. We will now ‘pull back’ the above expected value, to the time point n − 1. Consider therefore the parent situation of w in the probability tree; we will compute the expected value of f in this parent situation.

This parent is a situation v of length \(\lvert v\lvert =\lvert w\lvert -1=n\), which entirely coincides with w: v i = w i for all i = 0, …, n − 1. Associated to v is the local probability model p v which, as we have discussed above, represents the probability with which a random walk along the tree travels through the various children of v. In particular, such a random walk goes through the situation w, with probability p v(w ). Therefore, the contribution of the expected value in w, to the expected value in v, is the expected value in w weighted by p v(w ). Since this holds for all children of v, we can write

$$\displaystyle \begin{aligned} &\mathbb{E}\bigl[f(X_{0:n}) \,\big\vert\, X_{0:(n-1)}=v \bigr] = \sum_{x\in\mathcal{X}} p_v(x)\mathbb{E}\bigl[f(X_{0:n})\,\big\vert\,X_{0:(n-1)}=v,X_n=x\bigr]\,. \end{aligned} $$

This ‘pullback’ operation is graphically illustrated in Fig. 5.2.

Fig. 5.2
figure 2

Graphical illustration of ‘pulling back’ the expected value of a function f on X 0:2, in a probability tree on a binary state-space \(\mathcal {X}=\{a,b\}\). Top: the function f is entirely determined by the situations of length 3, i.e. the expected value of the function in those situations is simply the value of the function evaluated in that situation. Bottom: the result after ‘pulling back’ the expectations by one step. The resulting conditional expectation is a function whose value is entirely determined by the situations of length 2. The values are the weighted average of the expectations in the child nodes, weighted by the local models p (⋅)

Now, observe that the above conditional expectation of f in v is itself a real-valued function in \(\mathcal {L}(\mathcal {X}^n)\). Its value is determined by the states at times 0, …, n − 1. We can therefore repeat the above argument; we pull back to the parent of v, then to the parent of that situation and so on. Eventually, the parent that we are considering is the empty situation ; we then finish by computing

which is exactly the expected value of f that we started out wanting to compute.

This method to compute the expected value of a function by ‘pulling back’ the ‘local’, or conditional, expected values, uses the interpretation of a stochastic process as a probability tree. The method relies on a property that is called the law of iterated expectation, or alternatively the law of total probability. It can be stated formally in the measure-theoretic context, where it is also easily stated for continuous-time stochastic processes.

Theorem 5.1

Fix a time-dimension \(\mathbb {T}\in \{\mathbb {N}_{0},\mathbb {R}_{\geq 0}\}\) , and let \(\{X_t\}_{t\in \mathbb {T}}\) be a stochastic process on \((\varOmega ,\mathcal {F},P)\) . Choose any three ordered sequences s = s 1, …, s n; t = t 1, …, t m and u = u 1, …, u in \(\mathbb {T}\) , with \(n,m,\ell \in \mathbb {N}\) such that s n < t 1 and t m < u 1 . Then for any real-valued function \(f\in \mathcal {L}(\mathcal {X}^{n+m+\ell })\) on X s, X t, X u , it holds that

$$\displaystyle \begin{aligned} \mathbb{E}\bigl[f(X_{\mathbf{s}},X_{\mathbf{t}},X_{\mathbf{u}})\,\big\vert\,X_{\mathbf{s}}\bigr] = \mathbb{E}\Bigl[\mathbb{E}\bigl[f(X_{\mathbf{s}},X_{\mathbf{t}},X_{\mathbf{u}})\,\big\vert\,X_{\mathbf{s}},X_{\mathbf{t}}\bigr]\,\Big\vert\,X_{\mathbf{s}}\Bigr]\,, \end{aligned}$$

whenever P(X s) and P(X s, X t) are everywhere strictly positive.

In this result, the final constraint is required to ensure that the conditional expectations are all well-defined in the measure-theoretic sense. This point did not arise in the discussion using probability trees, because there the local (conditional) models are always properly defined by the model specification.

Having discussed how to interpret probability trees and how to use them to reason about the computation of expected values, we now move on to a discussion of their structural properties. Note that the specification of a probability tree is still relatively complicated. This is not really due to the structure of the tree; the situations and prefix order ≺ carry enough information to construct the tree up to any desired level, and their mathematical specification is straightforward. However, in order to specify all the local models p (⋅), we need to provide an infinite number of probability mass functions on \(\mathcal {X}\)—one for each situation . This is why one often restricts attention to simpler models, where one needs fewer, and often only finitely many, local models.

These simplifications can be seen as a matter of degree. At the one extreme, we have the general definition that we used above, where each situation has a local model p w. This leads to a lot of possible structure but is hard to specify. At the other extreme is the independent and identically distributed (i.i.d.) process; this is when we only have a single probability mass function p, and we set p w :   = p for all . For such a process, no matter what situation we are in, the next branch will always be chosen according to p. This process is easy to specify, but it does not yield a lot of structure that can capture the dynamics of the underlying system that we are trying to model.

A useful step up from the i.i.d. process is reached by the popular class of models known as homogeneous Markov chains. For a homogeneous Markov chain, the local model only depends on the last step of the corresponding situation, and not on what happened before that:

Definition 5.3 (Homogeneous Markov chain as probability tree)

A probability tree is called a homogeneous Markov chain if p v = p w for all situations \(v,w\in \mathcal {X}^*\) such that v  = w .

Corollary 5.2

Let be a homogeneous Markov chain. Then p w = p x for all \(x\in \mathcal {X}\) and all \(w\in \mathcal {X}^*\) such that w  = x.

Proof

Trivial from Definition 5.3 and the fact that all \(x\in \mathcal {X}\) are also situations. □

An example for the binary state-space \(\mathcal {X}=\{a,b\}\) is shown in Fig. 5.3. Additional degrees of freedom can be introduced back into this model by also letting the local models depend on the corresponding depth of the tree. The dynamics can then depend on the point in time, but not on the specific history up to that time. This yields the more general definition of a (non-homogeneous) Markov chain:

Fig. 5.3
figure 3

A homogeneous Markov chain, represented as a probability tree

Definition 5.4 (Markov chain as probability tree)

A probability tree is called a Markov chain if p v = p w for all situations \(v,w\in \mathcal {X}^*\) for which \(\lvert v\rvert =\lvert w\rvert \) and v  = w .

An example for the binary state-space \(\mathcal {X}=\{a,b\}\) is shown in Fig. 5.4. It can be verified that a homogeneous Markov chain is a Markov chain, but not—in general—the other way around. Note that, in contrast to homogeneous Markov chains where we only needed to specify local models p x for all \(x\in \mathcal {X}\), we now need different local models for each level of the tree. So, we are now back to needing an infinite number of local models in order to fully describe such a model.

Fig. 5.4
figure 4

A (non-homogeneous) Markov chain, represented as a probability tree as above

These definitions of (homogeneous) Markov chains can also be conveniently translated back to the measure-theoretic context. We here give the general definition, for an arbitrary time-dimension (so, either \(\mathbb {T}=\mathbb {N}_{0}\) or \(\mathbb {T}=\mathbb {R}_{\geq 0}\)) and multiple steps into the future:

Definition 5.5 (Markov chain as probability measure)

A stochastic process \(\{X_t\}_{t\in \mathbb {T}}\) on \((\varOmega ,\mathcal {F},P)\) is called a Markov chain if for all \(s_1,\ldots ,s_n,t\in \mathbb {T}\), \(n\in \mathbb {N}\), such that s 1 < ⋯ < s n < t, it holds that \(P(X_t\,\vert \,X_{s_1},\ldots ,X_{s_n}) = P(X_t\,\vert \,X_{s_n})\). A stochastic process that is a Markov chain is said to have the Markov property.

Similarly, the notion of homogeneity can be defined measure-theoretically and for an arbitrary time-dimension:

Definition 5.6 (Homogeneous Markov chain as probability measure)

A stochastic process \(\{X_t\}_{t\in \mathbb {T}}\) on \((\varOmega ,\mathcal {F},P)\) is called a homogeneous Markov chain if it is a Markov chain, and if additionally, for all \(s,t\in \mathbb {T}\) such that s < t, it holds that P(X t | X s) = P(X ts | X 0).

We leave it as an exercise to verify that, when \(\mathbb {T}=\mathbb {N}_{0}\), Definitions 5.5 and 5.6 correspond to what we would expect from Definitions 5.4 and 5.3, respectively.

5.2.2 Bayesian Networks

We now move on to a different graphical representation of stochastic processes that is useful for Markov chains in particular: Bayesian networks (BNs), a specific type of probabilistic graphical model. While the graphical structure of probability trees in Sect. 5.2.1 emphasised the partial paths in the realisation of a stochastic process, the BN representation emphasises the individual random variables X t.

The BN representation of a discrete-time Markov chain \(\{X_t\}_{t\in \mathbb {N}_{0}}\) is given in Fig. 5.5. The structure is a directed acyclic graph, with one node associated to each random variable X t and arcs representing the dependence of the receiving node’s random variable’s distribution, on the originating node’s random variable’s value. Due to the Markov property (c.f. Definition 5.5), each random variable X n, \(n\in \mathbb {N}\), is only (‘directly’) dependent on X n−1, the value of the random variable immediately before it. The initial variable X 0 is somewhat of a special case, since it does not depend on any other variables; there are no time points preceding it. Due to these properties, the graphical structure is that of a chain; this may go some way in explaining the name ‘Markov chain’. In the remainder of this section, we will refer to both a node in the BN and to its random variable, using the notation X t.

Fig. 5.5
figure 5

Bayesian network representation of a discrete-time Markov chain \(\{X_t\}_{t\in \mathbb {N}_{0}}\). Nodes represent random variables. An incoming arc on a node represents that the distribution of the corresponding random variable is influenced by the originating node of that arc. Correspondingly, each node associates a probability distribution to its random variable, conditional on the values of the random variables of the nodes on which it is dependent as before

It should be emphasised that the graphical structure is not saying that only nodes which are adjacent in the BN can influence each other. The formal interpretation is as follows: for any node X n, \(n\in \mathbb {N}\), conditional on the value of the parent(s) of X n, the distribution of X n is probabilistically independent of the non-parents, non-descendants of X n. This is the general interpretation of the independence properties of the arcs in a BN. In the special case of Markov chains that we are considering here, the interpretation vastly simplifies. Notably, the ‘non-parents, non-descendants’ of any node X n are exactly its ‘grandparents’, ‘great-grandparents’ and so on; it is the set of nodes \(\{X_{m}\,:\,m\in \mathbb {N}_{0},\, m<n-1\}\).

Put differently, the value of X n influences the distribution of all of its descendants (i.e. the nodes X m, m > n), so long as we do not know the value of any of those descendants themselves. We will next consider how we can quantify this.

We start by observing that for each node X n, \(n\in \mathbb {N}\), we have the associated conditional probability P(X n | X n−1). Since the state-space \(\mathcal {X}\) is taken to be finite, we can conveniently represent these conditional probabilities in a \(\lvert \mathcal {X}\rvert \times \lvert \mathcal {X}\rvert \) matrix. For any \(t\in \mathbb {N}_{0}\), this matrix T t is defined, for all \(x,y\in \mathcal {X}\), as

$$\displaystyle \begin{aligned} T_t(x,y) :\!= P(X_{t+1}=y\,\vert\,X_t=x)\,, \end{aligned} $$
(5.3)

where the indexing is taken to be row-first. This matrix T t is called the transition matrix of the Markov chain at time t. Its elements T t(x, y) are called the transition probabilities from x to y, and they are the probabilities that a system that is in state x at time t will be in state y at time t + 1. This explains the subscript-indexing, whereby the matrix T t contains the conditional probabilities associated to node X t+1.

These transition matrices make it easy to connect back to the probability tree representation of Markov chains that we encountered earlier:

Proposition 5.1

Let be a probability tree that is a Markov chain, and let T t denote the associated family of transition matrices, as defined above. Then for all \(t\in \mathbb {N}\) and all \(w\in \mathcal {X}^*\) such that \(\lvert w\rvert = t\) , it holds that p w(y) = T t(w , y) for all \(y\in \mathcal {X}\).

Proof

Use Eq. (5.2), Definition 5.5 and Eq. (5.3). □

The reason that we represent these probabilities using matrices is that this opens up the entire toolbox of linear algebra. We will see that this allows us to very succinctly write down certain relations and properties. For instance, we can now write the influence of a node on its descendants, using a simple matrix product:

Proposition 5.2

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time Markov chain, and let T t be the associated family of transition matrices, as defined above. Then for all \(s,t\in \mathbb {N}_{0}\) such that s  t, and all \(x,y\in \mathcal {X}\) , it holds that \(P(X_{t+1}=y\,\vert \,X_s=x) = \left [T_s\cdots T_{t}\right ](x,y)\).

Proof

We give a proof by induction. For t = s the result is immediate from the definition of the transition matrix T s. Now suppose the result is true for t − 1; we show that it is also true for t:

$$\displaystyle \begin{aligned} P(X_{t+1}=y\,\vert\,X_s=x) &= \sum_{z\in\mathcal{X}} P(X_{t+1}=y,X_{t}=z\,\vert\,X_s=x) \\ &= \sum_{z\in\mathcal{X}} P(X_{t}=z\,\vert\,X_s=x) P(X_{t+1}=y\,\vert\,X_{t}=z,\,X_s=x) \\ &= \sum_{z\in\mathcal{X}} \left[T_s\cdots T_{t-1}\right](x,z) P(X_{t+1}=y\,\vert\,X_{t}=z) \\ &= \sum_{z\in\mathcal{X}} \left[T_s\cdots T_{t-1}\right](x,z) T_{t}(z,y) = \bigl[T_s\cdots T_{t-1}T_{t}\bigr](x,y)\,, \end{aligned} $$

where the first and second equalities are basic properties of probabilities, the third equality is due to the induction hypothesis and the Markov property (c.f. Definition 5.5), the fourth equality uses the definition of the transition matrix T t and the final equality uses the definition of a matrix product. □

Another useful property of this representation is that it allows us to write conditional expectations of functions \(f\in \mathcal {L}(\mathcal {X})\) using matrix-vector products. In particular, again because \(\mathcal {X}\) is finite, any \(f\in \mathcal {L}(\mathcal {X})\) can be interpreted as a vector in \(\mathbb {R}^{\lvert \mathcal {X}\rvert }\); the coordinates are simply the values f(x), \(x\in \mathcal {X}\). Hence:

Proposition 5.3

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time Markov chain, and let T t be the associated family of transition matrices. Then, for all \(f\in \mathcal {L}(\mathcal {X})\) , all \(t\in \mathbb {N}_{0}\) and all \(x\in \mathcal {X}\) , it holds that \(\mathbb {E}\bigl [f(X_{t+1})\,\vert \,X_t=x\bigr ] = \left [T_tf\right ](x)\).

Proof

Simply use the definition of the matrix-vector product:

$$\displaystyle \begin{aligned} \left[T_tf\right](x){=}\!\sum_{y\in\mathcal{X}}\!T_t(x,y)f(y){=}\!\sum_{y\in\mathcal{X}}\!P(X_{t{+}1}{=}y\,\vert\,X_t{=}x)f(y) {=}\mathbb{E}\bigl[f(X_{t{+}1})\,\vert\,X_t{=}x\bigr]. \end{aligned}$$

The above properties can be combined to give a simplified version of the law of iterated expectation (Theorem 5.1) that we encountered in Sect. 5.2.1:

Corollary 5.3

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time Markov chain, and let T t be the associated family of transition matrices. Then, for all \(f\in \mathcal {L}(\mathcal {X})\) , all \(s,t\in \mathbb {N}_{0}\) such that s  t and all \(x\in \mathcal {X}\) , it holds that \(\mathbb {E}\bigl [f(X_{t+1})\,\vert \,X_s=x\bigr ] = \left [T_s\cdots T_tf\right ](x)\).

Proof

Immediate from Propositions 5.2 and 5.3. □

Note that, where the law of iterated expectation in Theorem 5.1 could be interpreted as ‘pulling back’ in the associated probability tree, the above simplified version can additionally be interpreted as ‘pulling back’ the conditional expectations in the associated BN, through the product of the transition matrices. This is graphically represented in Fig. 5.6.

Fig. 5.6
figure 6

Graphical representation of the ‘pulling back’ interpretation of the simplified version of the law of iterated expectation in Corollary 5.3. The function f, of which we want to compute the expectation on X t+1, given X s, starts at node X t+1, where its value is trivial. The function is then ‘pulled back’ to the parent X t of X t+1, by taking the local expectation, by left-multiplying with T t. This new function T t f on X t is then ‘pulled’ back by multiplying with T t−1 and so forth. Eventually, the function T s+1T t f is pulled into X s, by left-multiplying with T s. The resulting function on X s is the conditional expectation of interest as before

5.2.3 Transition Graphs

We now move on to yet another graphical representation: the transition graph of a homogeneous (discrete-time) Markov chain. We start by noticing the following:

Proposition 5.4

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time homogeneous Markov chain, and let T t be the associated family of transition matrices. Then there is a unique matrix T such that T t = T for all \(t\in \mathbb {N}_{0}\).

Proof

The matrix of interest can be identified as T :   = T 0. Now, using the definition of a homogeneous Markov chain (Definition 5.6) and the transition matrix T t for any \(t\in \mathbb {N}_{0}\), it holds for all \(x,y\in \mathcal {X}\) that

$$\displaystyle \begin{aligned} T(x,y) {=} T_0(x,y) &{=} P(X_1{=}y\,\vert\,X_0{=}x) \\ &{=} P(X_{(t{+}1){-}t}{=}y\,\vert\,X_0{=}x){=}P(X_{t+1}{=}y\,\vert\,X_t=x)=T_t(x,y)\,, \end{aligned} $$

which concludes the proof; uniqueness is trivial. □

As an aside, note therefore that a discrete-time homogeneous Markov chain can be characterised (up to the initial distribution P(X 0)) by a single transition matrix T. In particular, this T can be seen as the canonical parameter of the Markov chain. This relative ease of parameterisation—compared to say an arbitrary stochastic process, which needs separate parameters for every possible history—is arguably one of the reasons that make homogeneous Markov chains such convenient and widely used models.

Moving on, the transition graph of a discrete-time homogeneous Markov chain is a graphical representation of its associated transition matrix T. In this way, this representation emphasises the interactions between the states, rather than the random variables. An example transition graph is shown in Fig. 5.7. The formal definition is as follows:

Fig. 5.7
figure 7

Example transition graph for a discrete-time homogeneous Markov chain with a ternary state-space \(\mathcal {X}=\{a,b,c\}\). The transition graph is a directed graph, with a vertex for each state and an arc from the vertex of x to that of y, with \(x,y\in \mathcal {X}\), whenever T(x, y) = P(X 1 = y | X 0 = x) > 0. The arcs are labelled with the corresponding transition probabilities. The figure uses the shorthand notation P(y|x) for the elements T(x, y) of T as before

Definition 5.7 (Transition graph)

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time homogeneous Markov chain, and let T be its associated transition matrix. Then its associated transition graph is a directed graph (V, E) with one vertex for each state, \(V=\mathcal {X}\), and, for all \(x,y\in \mathcal {X}\), an arc (x, y) ∈ E whenever T(x, y) > 0.

One of the reasons transition graphs are sometimes useful is that they allow one to study which parts of a system can be reached from other parts of the system. The simplest application is that of communicating states:

Definition 5.8 (Communicating states)

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time homogeneous Markov chain, and let T be its associated transition matrix. For any two states \(x,y\in \mathcal {X}\), y is said to be accessible from x if there is some \(n\in \mathbb {N}\) such that T n(x, y) > 0. Furthermore, x and y are said to communicate if y is accessible from x, and x is accessible from y.

Note that in the above, the term T n denotes the n-th matrix power of T (c.f. Proposition 5.2). This has an intuitive graphical interpretation:

Corollary 5.4

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time homogeneous Markov chain. Then for any \(x,y\in \mathcal {X}\) , y is accessible from x if and only if there is a path from x to y in the associated transition graph. Furthermore, x and y communicate if and only if there is a cycle in the associated transition graph that contains both x and y.

Proof

Trivial from Definitions 5.7 and 5.8. □

Inspection of the transition graph in Fig. 5.7 shows that, in that example, all states communicate with each other. When this is the case, i.e. when all states communicate, the Markov chain is said to be irreducible. A maximal set of states that all communicate with each other is called a communication class. Hence, an irreducible Markov chain has only a single communication class, which is equal to \(\mathcal {X}\).

Note that not every Markov chain is irreducible; in general there may be more than one communication class. An example is given in Fig. 5.8. When a communication class \(\mathcal {A}\subset \mathcal {X}\) is accessible from a different communication class \(\mathcal {B}\subset \mathcal {X}\), then \(\mathcal {A}\) is said to dominate \(\mathcal {B}\). A communication class which is not dominated is called maximal. When a Markov chain has only a single maximal communication class, this is called the top (communication) class.

Fig. 5.8
figure 8

Transition graph of a Markov chain that is not irreducible. It has two communication classes, {a, b} and {c, d}. The set {c, d} dominates {a, b} and is the top (communication) class of the Markov chain. This Markov chain is top class regular

Investigation of the communicating states in a Markov chain is often useful when one is interested in the long-term behaviour of the system. After all, while a system might begin in one state, it need not necessarily always eventually return to that state; this is the property that is illustrated in Fig. 5.8.

An important concept is that of the regularity of the communication classes of a Markov chain. A communication class is regular if there is a number \(n\in \mathbb {N}\) such that it is possible to go from any state in the class to any other state in the class, in exactly n steps. Of particular importance is the notion of top class regularity:

Definition 5.9 (Top class regularity)

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time homogeneous Markov chain, and let T be its associated transition matrix. Then the Markov chain is said to be top class regular if

$$\displaystyle \begin{aligned} \bigl\{ y\in\mathcal{X}\,:\,(\exists n\in\mathbb{N})(\forall x\in\mathcal{X})\, T^n(x,y)>0 \bigr\} \neq \emptyset\,, \end{aligned}$$

and in that case the top class \(\mathcal {X}_{\mathrm {top}}\) of the Markov chain exists and is equal to this set. When furthermore \(\mathcal {X}_{\mathrm {top}}=\mathcal {X}\), the Markov chain itself is said to be regular.

The reason that this property is so important is that it provides a sufficient condition for the long-term behaviour of a Markov chain to converge to a stationary distribution, regardless of the state in which it started:

Theorem 5.2

Let \(\{X_t\}_{t\in \mathbb {N}_{0}}\) be a discrete-time homogeneous Markov chain, and let T be its associated transition matrix. Let this Markov chain be regular. Then there is a probability mass function \(P_\infty :\mathcal {X}\to \mathbb {R}_{\geq 0}\) such that, for all \(x,y\in \mathcal {X}\),

$$\displaystyle \begin{aligned} P_\infty(y) = \lim_{n\to+\infty}T^n(x,y)\,. \end{aligned}$$

5.3 Imprecise Discrete-Time Markov Chains

We will now move on to the discussion surrounding imprecise (discrete-time) Markov chains (IDTMCs). So, we still consider the time-dimension \(\mathbb {T}=\mathbb {N}_{0}\). We will generalise each of the representations that we previously encountered to this new setting, where we roughly follow the same order as in Sect. 5.2.

So, let us start with the ‘measure-theoretic’ representation of imprecise stochastic processes. In this setting, we consider a set \(\mathbb {P}\) of probability measures on the measurable space of paths \((\varOmega ,\mathcal {F})\). Then for each \(P\in \mathbb {P}\), we have a probability space \((\varOmega ,\mathcal {F},P)\), to which we can associate the precise stochastic process \(\{X_t\}_{t\in \mathbb {N}_{0}}\) as in Definition 5.1. For any function \(f\in \mathcal {L}(\mathcal {X}^n)\), \(n\in \mathbb {N}\), we can express the expected value on the n time points \(\mathbf {t}\subset \mathbb {N}_{0}\) as \(\mathbb {E}_P[f(X_{\mathbf {t}})]\) as in Sect. 5.2. Recall from Chap. 2 that in this imprecise probabilistic context, we are more generally interested in the lower and upper expectation of f, which are defined, respectively, as

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{\mathbf{t}})\bigr] :\!= \inf_{P\in\mathbb{P}} \mathbb{E}_P\bigl[f(X_{\mathbf{t}})\bigr]\quad \text{and}\quad \overline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{\mathbf{t}})\bigr] :\!= \sup_{P\in\mathbb{P}} \mathbb{E}_P\bigl[f(X_{\mathbf{t}})\bigr]\,. \end{aligned}$$

We briefly recall the well-known conjugacy relation \(\overline {\mathbb {E}}_{\mathbb {P}}\bigl [f(X_{\mathbf {t}})\bigr ]=- \underline {\mathbb {E}}_{\mathbb {P}}\bigl [-f(X_{\mathbf {t}})\bigr ]\), from which it follows that we can present the remainder of this discussion entirely in terms of lower expectations; any corresponding results on upper expectations follow directly through this relation.

Slightly more generally than the above, we will focus on conditional lower expectations. Similar to the precise case that we discussed before, these are defined for any \(f\in \mathcal {L}(\mathcal {X}^{n+m})\), \(n,m\in \mathbb {N}\), any \(\mathbf {s},\mathbf {t}\subset \mathbb {N}_{0}\) such that s and t are of length n and m, respectively, and any \(x_{\mathbf {s}}\in \mathcal {X}^n\), as

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{\mathbf{s}},X_{\mathbf{t}})\,\big\vert\,X_{\mathbf{s}}=x_{\mathbf{s}}\bigr] :\!= \inf_{P\in\mathbb{P}} \mathbb{E}_P \bigl[f(X_{\mathbf{s}},X_{\mathbf{t}})\,\big\vert\,X_{\mathbf{s}}=x_{\mathbf{s}}\bigr]\,, \end{aligned}$$

whenever \( \underline {\mathbb {E}}_{\mathbb {P}}[\mathbb {I}_{x_{\mathbf {s}}}(X_{\mathbf {s}})]>0\). In this last condition, \(\mathbb {I}_{x_{\mathbf {s}}}\) is the indicator of x s; for all \(y_{\mathbf {s}}\in \mathcal {X}^n\), \(\mathbb {I}_{x_{\mathbf {s}}}(y_{\mathbf {s}}):\!= 1\) if x s = y s and \(\mathbb {I}_{x_{\mathbf {s}}}(y_{\mathbf {s}}):\!= 0\), otherwise. Note that then

$$\displaystyle \begin{aligned} 0 < \underline{\mathbb{E}}_{\mathbb{P}}[\mathbb{I}_{x_{\mathbf{s}}}(X_{\mathbf{s}})] = \inf_{P\in\mathbb{P}} \mathbb{E}_P[\mathbb{I}_{x_{\mathbf{s}}}(X_{\mathbf{s}})] = \inf_{P\in\mathbb{P}} P(X_{\mathbf{s}}=x_{\mathbf{s}})\,, \end{aligned}$$

so this condition guarantees that the conditional expectations are well-defined for all the precise measures \(P\in \mathbb {P}\). As before, there are formalisms where this condition is not strictly required—see, for example, the discussion around the local models of probability trees—or where it can be weakened. For simplicity, we keep the condition here to ensure that everything remains well-defined also under the measure-theoretic interpretation.

We are now ready to give the formal definition of an imprecise discrete-time Markov chain (IDTMC):

Definition 5.10 (IDTMC as set of processes)

An imprecise discrete-time Markov chain is a set \(\mathbb {P}\) of probability measures on the measurable space \((\varOmega ,\mathcal {F})\), with associated lower expectation operator \( \underline {\mathbb {E}}_{\mathbb {P}}\) as defined above, such that, for all \(f\in \mathcal {L}(\mathcal {X})\) and all \(s_1,\ldots ,s_n,t\in \mathbb {N}_{0}\) such that s 1 < ⋯s n < t,

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_t)\,\big\vert\, X_{s_1},\ldots,X_{s_n}\bigr] = \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_t)\,\big\vert\, X_{s_n}\bigr]\,. \end{aligned}$$

Furthermore, an imprecise discrete-time Markov chain is called homogeneous if, for all \(s,t\in \mathbb {N}_{0}\), s < t, and all \(f\in \mathcal {L}(\mathcal {X})\), it holds that \( \underline {\mathbb {E}}_{\mathbb {P}}[f(X_t)\,\vert \,X_s]= \underline {\mathbb {E}}_{\mathbb {P}}[f(X_{t-s})\,\vert \,X_0]\).

Let us compare this with Definition 5.5, the measure-theoretic definition of a precise Markov chain. The first difference is that the imprecise definition above is phrased in terms of (lower) expectations, whereas the precise definition used probabilities. We recall that this is because, in the framework of imprecise probability, it does not suffice to state results in terms of (lower) probabilities; instead the more general language of (lower) expectation operators is required.

Nevertheless, this definition implies that, in terms of lower probabilities,

$$\displaystyle \begin{aligned} \inf_{P\in\mathbb{P}} P(X_t=x\,\vert\,X_{s_1},\ldots,X_{s_n}) &= \underline{\mathbb{E}}_{\mathbb{P}}\bigl[\mathbb{I}_{x}(X_t)\,\big\vert\, X_{s_1},\ldots,X_{s_n}\bigr] \\ &= \underline{\mathbb{E}}_{\mathbb{P}}\bigl[\mathbb{I}_{x}(X_t)\,\big\vert\, X_{s_n}\bigr] = \inf_{P\in\mathbb{P}} P(X_t=x\,\vert\,X_{s_n})\,, \end{aligned} $$

which displays this imprecise Markov condition in more familiar terms.

One may wonder at this point whether an imprecise Markov chain \(\mathbb {P}\) is itself a set of Markov chains; the answer to this question is a resounding no (or at least, not necessarily). This point deserves the strongest possible emphasis:

An element of an imprecise Markov chain \(\mathbb {P}\) need not be a Markov chain! So, in general \(P(X_t\,\vert \,X_{s_1},\ldots ,X_{s_n})\neq P(X_t\,\vert \,X_{s_n})\) for \(P\in \mathbb {P}\), with s 1 < ⋯s n < t in \(\mathbb {N}_{0}\).

To clarify, the ‘imprecise Markov condition’ of an imprecise Markov chain is an ‘independence’ assessment about the lower envelope only. Formally, it is an assessment of epistemic irrelevance—a specific type of independence that arises in imprecise probability theory—which is weaker than strong independence a different type of independence, and what would hold of all \(P\in \mathbb {P}\) were Markov chains.

In a similar vein, the notion of homogeneity is here only enforced on the lower envelope. So, for an IDTMC \(\mathbb {P}\) that is homogeneous, there may be processes \(P\in \mathbb {P}\) that are neither Markov nor homogeneous.

The reason why we stress this so strongly is twofold. First of all, it implies that the structural assumptions of an imprecise Markov chain are in fact much weaker than those of a precise Markov chain—we no longer assume that future events are fully independent of the history, given the current state, or that their distribution is independent of the point in time. They might be, of course—there are elements \(P\in \mathbb {P}\) that satisfy those properties—but it’s not enforced as strictly. In other words, this model also represents ‘higher-order’ uncertainty about the structural properties of the system that we are trying to model.

The second reason is that this property is central to all the efficient computational methods that have been developed for working with imprecise Markov chains. We will next illustrate this point by moving the discussion to the representation of IDTMCs as imprecise probability trees.

5.3.1 Imprecise Probability Trees

Recall that for precise probability trees, we associate with each situation a local model p w, which is a probability mass function on \(\mathcal {X}\). In contrast, in order to define imprecise probability trees, we will consider imprecise local models. Such an imprecise local model \(\mathcal {P}_w\) is simply a set of probability mass functions on \(\mathcal {X}\). This leads to the following definition:

Definition 5.11 (Imprecise probability tree)

An imprecise probability tree is a tuple , where is an event tree and \(\mathcal {P}_{(\cdot )}\) is a set-valued function such that, for all , \(\mathcal {P}_w\) is a non-empty set of probability mass functions on \(\mathcal {X}\).

An obvious question is how one should interpret such imprecise probability trees. As a first step, we consider the (precise) probability trees that are compatible with a given imprecise probability tree:

Definition 5.12

Let be an imprecise probability tree. Then a (precise) probability tree is called compatible with this imprecise probability tree, if \(p_w\in \mathcal {P}_w\) for all .

This immediately lets us connect back to the sets-of-measures that we discussed before. Specifically, consider an imprecise probability tree , and suppose the tree is compatible with it. Then, using the method outlined in Sect. 5.2.1, we can associate a (precise) measure P to this precise tree. Collecting in the set \(\mathbb {P}\) all the associated measures of all precise trees that are compatible with the imprecise tree, we obtain a set representation as in Sect. 5.3.

The connection in the other direction is analogous but a bit more subtle. In particular, if we start from an IDTMC \(\mathbb {P}\), then each \(P\in \mathbb {P}\) induces a precise probability tree. Using the local models of this tree, we can construct set-valued local models by simply varying P over \(\mathbb {P}\). These set-valued local models can then be used to construct an imprecise probability tree. Clearly, there are then precise trees that are compatible with this imprecise tree, and each such precise tree induces a precise measure P′. However, and this is the crucial observation, it is in general not guaranteed that such P′ are included in \(\mathbb {P}\)!

As a simple example, suppose that \(\mathcal {X}=\{a,b\}\) and we start with a set \(\mathbb {P}\) containing only two i.i.d. processes, whose local models are given by p, h, respectively. Then, the induced imprecise probability tree has local models \(\mathcal {P}_w=\{p,h\}\) for all . On the other hand, we can easily construct a non-i.i.d. process such that, for all , its local model is p w = p if w  = a and p w = h, otherwise. Then clearly this process was not in the original set \(\mathbb {P}\), but it is compatible with the imprecise probability tree.

To prevent this from happening, we will require that the set representation \(\mathbb {P}\) of the IDTMC is ‘large enough’. Specifically, what we need is that it is already closed under such ‘recombination’ of local models at different points in time. Whenever this property holds, we will say that the IDTMC is separately specified. Clearly, when we start from an imprecise probability tree and construct its set of compatible processes, this IDTMC will then satisfy this property. In the remainder of this section, we will assume that a given set \(\mathbb {P}\) is indeed separately specified. Further on, when we consider the parametrisation of an IDTMC, we will consider an easy condition that ensures this will hold.

With this connection between the two representations in place, we can again start to consider computational methods for lower expectations. Analogous to what we have seen before, in this context we have a law of iterated lower expectation that we can use as a computational tool. The imprecise probability tree representation again provides graphical intuition.

Similar to the exposition in Sect. 5.2.1, we start with a function \(f\in \mathcal {L}(\mathcal {X}^{n+1})\) of which we want to compute the lower expectation with respect to the states at the time points 0, …, n. Then for any situation \(w\in \mathcal {X}^*\) such that \(\lvert w\rvert = n+1\), the lower expectation is trivial:

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\left[f(X_{0:n})\,\big\vert\,X_{0:n}=w\right] = f(w)\,. \end{aligned}$$

We then again ‘pull back’ to the parent situation v of w; this is where the main difference with Sect. 5.2.1 occurs. Notably, we here have an imprecise local model \(\mathcal {P}_v\) associated to this node v. The point to the law of iterated lower expectation is that it suffices to only compute the associated conditional lower expectation locally:

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\left[f(X_{0:n})\,\big\vert\,X_{0:(n{-}1)}{=}v\right]{=}\inf_{p_v\in\mathcal{P}_v} \sum_{x\in\mathcal{X}} p_v(x) \underline{\mathbb{E}}_{\mathbb{P}}\left[f(X_{0:n})\,\big\vert\,X_{0:(n{-}1)}{=}v,X_n{=}x\right]\,. \end{aligned}$$

Exactly analogous to the precise case, by repeatedly pulling back until we reach the root of the tree, we eventually compute

which is the lower expectation of interest.

As before, the need to specify these (imprecise) local models \(\mathcal {P}_w\) for all situations makes such a model difficult to work with. This is simplified for imprecise Markov chains; note that we here assume the analogue of homogeneity to hold implicitly:

Definition 5.13 (Homogeneous IDTMC as imprecise probability tree)

An imprecise probability tree is called an imprecise homogeneous discrete-time Markov chain if \(\mathcal {P}_v=\mathcal {P}_w\) for all \(v,w\in \mathcal {X}^*\) for which v  = w .

Corollary 5.5

Let be a homogeneous IDTMC. Then \(\mathcal {P}_w=\mathcal {P}_x\) for all \(x\in \mathcal {X}\) and all \(w\in \mathcal {X}^*\) such that w  = x.

Proof

Trivial from Definition 5.13 and the fact that all \(x\in \mathcal {X}\) are also situations. □

As above, an IDTMC has a set of compatible precise probability trees, each of which induces a measure P, and these are collected in the set \(\mathbb {P}\), which is the measure-theoretic IDTMC representation from Definition 5.10. Observe that a precise probability tree does not have to be a (homogeneous) Markov chain, for it to be compatible with a given IDTMC! That is, to be compatible, each local model p w, , should be in the set \(\mathcal {P}_{w_\top }\), and this set depends only on the most recent state w of the situation w. But, while in a different situation v such that v  = w , we do require that \(p_v\in \mathcal {P}_{v_\top }=\mathcal {P}_{w_\top }\); we do not require that p v = p w!

We will next illustrate that the law of iterated lower expectation simplifies further for imprecise Markov chains. We do this again by considering the imprecise counterpart of Bayesian networks.

5.3.2 Credal Networks

We here consider the graphical representation of imprecise Markov chains as credal networks. This is the imprecise generalisation of the Bayesian network representation that we encountered in Sect. 5.2.2. The graphical structure is as before, with the notable differences being (i) the local models (which are here replaced with imprecise local models) and (ii) the interpretation of the independence properties induced by the arcs. Regarding the second point, it suffices for our present purpose to note that we interpret the structure as a credal network under epistemic irrelevance. This then has the same consequence as that stated in the beginning of Sect. 5.3: given the value of the parent of a node X t, \(t\in \mathbb {N}_{0}\), the lower expectation of any function dependent on X t does not depend on the values of the non-parents, non-descendants (again, grandparents and so on) of X t. For reference, the graphical representation is drawn in Fig. 5.9.

Fig. 5.9
figure 9

Credal network representation of an imprecise discrete-time Markov chain. An incoming arc on a node represents that the local uncertainty model of the corresponding variable is influenced by the originating node of that arc. Correspondingly, each node associates an imprecise probability model to its variable, conditional on the values of the random variables of the nodes on which it is dependent

The interpretation in terms of sets of distributions is as would be expected; the model induces a set \(\mathbb {P}\), each \(P\in \mathbb {P}\) of which satisfies \(P(X_n\,\vert \,X_{n-1})\in \mathcal {P}(X_n\,\vert \,X_{n-1})\) for all \(n\in \mathbb {N}\), and \(P(X_0)\in \mathcal {P}(X_0)\). As before, the independence assumptions are not necessarily required to hold for these compatible precise models. Conversely, if we are given an IDTMC \(\mathbb {P}\), then the local models \(\mathcal {P}(X_n\vert X_{n-1})\) of the credal network are constructed by restricting attention to the conditional events P(X n | X n−1) and varying P over \(\mathbb {P}\).

Similar to the discussion around the interpretation of imprecise probability trees, we here also need some ‘closedness’ assumptions to ensure this duality of representations holds. Specifically, we again require that \(\mathbb {P}\) is separately specified. Furthermore, it is assumed that the local models \(\mathcal {P}(X_n\,\vert \,X_{n-1})\) of the credal network have separately specified rows. This means that these local models are not arbitrary sets of conditional probabilities. If we let \(\mathcal {P}(X_n\,\vert \,X_{n-1}=x):\!= \bigl \{P(X_n\,\vert \,X_{n-1}=x)\in \mathcal {P}(X_n\,\vert \,X_{n-1})\bigr \}\) for all \(x\in \mathcal {X}\), then what we require is that

$$\displaystyle \begin{aligned} \mathcal{P}(X_n\,\vert\,X_{n-1})=\times_{x\in\mathcal{X}} \mathcal{P}(X_n\,\vert\,X_{n-1}=x)\,. \end{aligned} $$
(5.4)

Under these conditions, we can straightforwardly switch between representations.

We next generalise the exposition in Sect. 5.2.2 regarding the associated transition matrices. To this end, fix any \(t\in \mathbb {N}_{0}\). Then, as in the precise case, each element \(P(X_{t+1}\,\vert \,X_t)\in \mathcal {P}(X_{t+1}\,\vert \,X_t)\) induces a transition matrix T t. So, let us now consider the set \(\mathcal {T}_t\) of transition matrices that is induced by the imprecise local models:

$$\displaystyle \begin{aligned} \mathcal{T}_t :\!= \Big\{ T_t\,&:\, \big(\forall x,y\in\mathcal{X}\,:\, T_t(x,y)=P(X_{t+1}=y\,\vert\,X_t=x)\big), \\ &\quad P(X_{t+1}\,\vert\,X_t)\in\mathcal{P}(X_{t+1}\,\vert\,X_t) \Big\}\,. \end{aligned} $$

A key insight is that we can use this set of transition matrices to define a convenient computational tool for lower expectations:

Definition 5.14

Let \(\mathbb {P}\) be an IDTMC, and let \(\mathcal {T}_t\) be the associated family of sets of transition matrices, as defined above. Then, for each \(t\in \mathbb {N}_{0}\), the associated lower transition operator \( \underline {T}_t:\mathcal {L}(\mathcal {X})\to \mathcal {L}(\mathcal {X})\) is defined, for all \(f\in \mathcal {L}(\mathcal {X})\) and all \(x\in \mathcal {X}\), as

$$\displaystyle \begin{aligned} \bigl[\underline{T}_tf\bigr](x) :\!= \inf_{T_t\in\mathcal{T}_t}\bigl[T_tf\bigr](x)\,. \end{aligned}$$

This lower transition operator essentially fulfils the same role as the transition matrices from which it is derived. In particular, we have the following:

Proposition 5.5

Let \(\mathbb {P}\) be an IDTMC, and let \( \underline {T}_t\) be the associated family of lower transition operators. Then, for all \(f\in \mathcal {L}(\mathcal {X})\) , all \(t\in \mathbb {N}_{0}\) and all \(x\in \mathcal {X}\) , it holds that

$$\displaystyle \begin{aligned} \bigl[\underline{T}_tf\bigr](x) = \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{t+1})\,\vert\,X_t=x\bigr]\,. \end{aligned}$$

Proof

Simply use the definitions together with Proposition 5.3:

$$\displaystyle \begin{aligned} \bigl[\underline{T}_tf\bigr](x) {=} \inf_{T_t\in\mathcal{T}_t} \bigl[T_tf\bigr](x) &{=} \inf_{T_t\in\mathcal{T}_t} \sum_{y\in\mathcal{X}} f(y) T_t(x,y) \\ &{=} \inf_{P(X_{t{+}1}\vert X_t)\in\mathcal{P}(X_{t{+}1}\vert X_t)} \sum_{y\in\mathcal{X}} f(y) P(X_{t{+}1}{=}y\vert X_t=x) \\ &{=} \inf_{P\in\mathbb{P}} \sum_{y\in\mathcal{X}} f(y) P(X_{t{+}1}{=}y\vert X_t{=}x) \\ &{=} \inf_{P\in\mathbb{P}} \mathbb{E}_P\bigl[f(X_{t{+}1})\,\big\vert\,X_t{=}x\bigr] {=} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{t{+}1})\,\big\vert\,X_t{=}x\bigr]\,, \end{aligned} $$

where in the fourth equality, we used the definition of the compatible measures. □

As in Corollary 5.3, we can now state the simplified law of iterated lower expectation for imprecise Markov chains, using these lower transition operators:

Theorem 5.3

Let \(\mathbb {P}\) be an IDTMC that is separately specified, and let \( \underline {T}_t\) be the associated family of lower transition operators. Then, for all \(f\in \mathcal {L}(\mathcal {X})\) , all \(s,t\in \mathbb {N}_{0}\) such that s  t and all \(x\in \mathcal {X}\) , it holds that

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_t)\,\big\vert\,X_s=x\bigr] = \bigl[\underline{T}_s\cdots \underline{T}_tf\bigr](x)\,, \end{aligned}$$

where the right-hand side represents an iterated operator product (composition).

We omit the full proof, but the interested reader can reconstruct the argument by using the general computational process of iterated lower expectation as explained in Sect. 5.3.1, the imprecise Markov property from Definition 5.10 and the interpretation of the lower transition operator from Proposition 5.5.

5.3.3 Limits of Homogeneous IDTMCs

We conclude the discussion of imprecise discrete-time Markov chains with some results about their limit behaviour, in analogy to the results in Sect. 5.2.3. We start again by restricting attention to homogeneous IDTMCs, and notice the following (we omit the proof, which is straightforward):

Proposition 5.6

Let \(\mathbb {P}\) be a homogeneous IDTMC, and let \( \underline {T}_t\) be the associated family of lower transition operators. Then there is a unique lower transition operator \( \underline {T}:\mathcal {L}(\mathcal {X})\to \mathcal {L}(\mathcal {X})\) , such that, for all \(f\in \mathcal {L}(\mathcal {X})\), \( \underline {T}_tf= \underline {T}f\) for all \(t\in \mathbb {N}_{0}\).

We take a moment here to remark on a property that was already encountered in Chap. 2: the duality between lower expectation operators and closed and convex sets of probability measures. Indeed, this correspondence was also used in Definition 5.14 above, where we used the sets \(\mathcal {T}_t\) of transition matrices, to construct the lower transition operator \( \underline {T}_t\). Since, as we have just seen, the dynamics of a homogeneous IDTMC can be completely described by a single \( \underline {T}\), it now makes sense to think about the other direction.

Specifically, corresponding to \( \underline {T}\), there exists a closed and convex set \(\mathcal {T}\) of transition matrices, such that \( \underline {T}f=\inf _{T\in \mathcal {T}}Tf\) for all \(f\in \mathcal {L}(\mathcal {X})\). This implies that (up to the initial distribution at time zero) an IDTMC can also be characterised by such a set \(\mathcal {T}\). So, whereas we noted in Sect. 5.2.2 that a (precise) discrete-time Markov chain’s canonical parameter is a single transition matrix T, for a homogeneous IDTMC, the parameter can be understood as a single closed and convex set \(\mathcal {T}\) of transition matrices. Moreover, if in this parametrisation we ensure that \(\mathcal {T}\) has separately specified rows—essentially, satisfies a property exactly analogous to Eq. (5.4)—then the corresponding IDTMC will also be separately specified.

Furthermore, in Sect. 5.2.2 we used a property of the associated transition matrix T, to state a sufficient condition for the long-term behaviour of the Markov chain to converge to a distribution over the states, independently of the state in which it started. We here have a similar result, which starts by introducing the conjugate upper transition operator \(\overline {T}:\mathcal {L}(\mathcal {X})\to \mathcal {L}(\mathcal {X}):f\mapsto - \underline {T}(-f)\).

Now, recall that in the precise case, a homogeneous discrete-time Markov chain with transition matrix T was said to be regular, if there was some \(n\in \mathbb {N}\) such that T n(x, y) > 0 for all \(x,y\in \mathcal {X}\). The interpretation is clear: the Markov chain is regular if and only if there is some finite number of steps n in which every state x can reach every state y. This is now generalised to the imprecise case:

Definition 5.15 (Regularity for homogeneous IDTMC)

Let \(\mathbb {P}\) be a homogeneous IDTMC with associated lower (and upper) transition operator \( \underline {T}\) (and \(\overline {T}\)). Then the IDTMC is regular if there is some \(n\in \mathbb {N}\) such that \(\bigl [\overline {T}^n\mathbb {I}_y\bigr ](x)>0\) for all \(x,y\in \mathcal {X}\).

Let us consider this definition. One difference with the precise case is the introduction of the indicator function \(\mathbb {I}_y\) on the state \(y\in \mathcal {X}\); this was introduced because, in contrast to matrices, we cannot index the ‘elements’ of the transition operator. Specifically, using Theorem 5.3, we can interpret the condition as

$$\displaystyle \begin{aligned} 0 < \Bigl[\overline{T}^n\mathbb{I}_y\Bigr](x) = \overline{\mathbb{E}}_{\mathbb{P}}\bigl[\mathbb{I}_y(X_n)\,\big\vert\,X_0=x\bigr] = \sup_{P\in\mathbb{P}} P(X_n=y\,\vert\,X_0=x)\,, \end{aligned}$$

for all \(x,y\in \mathcal {X}\) and some \(n\in \mathbb {N}\). What regularity asks for, then, is for there to be some \(n\in \mathbb {N}\) such that is possible for all \(x,y\in \mathcal {X}\) to move from x to y in exactly n steps, according to some \(P\in \mathbb {P}\). In particular, the (precise) measure P for which this needs to be possible can be different for every pair \(x,y\in \mathcal {X}\). Regularity for IDTMCs then is in a sense a much weaker—easier to satisfy—condition than that for precise Markov chains. Nevertheless, the condition is sufficient for the following:

Theorem 5.4

Let \(\mathbb {P}\) be a homogeneous IDTMC that is separately specified and regular, with associated lower transition operator \( \underline {T}\) . Then, there is a unique lower expectation operator \( \underline {\mathbb {E}}_{\mathbb {P}}[\cdot (X_{+\infty })]:\mathcal {L}(\mathcal {X})\to \mathbb {R}\) such that, for all \(f\in \mathcal {L}(\mathcal {X})\) and all \(x\in \mathcal {X}\),

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{+\infty})\bigr] = \lim_{n\to+\infty} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_n)\,\vert\,X_0=x\bigr] = \lim_{n\to+\infty} \bigl[\underline{T}^nf\bigr](x)\,. \end{aligned}$$

Furthermore, this is the unique \( \underline {T}\) -invariant lower expectation on \(\mathcal {L}(\mathcal {X})\) , meaning that \( \underline {\mathbb {E}}_{\mathbb {P}}[f(X_{+\infty })]= \underline {\mathbb {E}}_{\mathbb {P}}\bigl [[ \underline {T}\,f](X_{+\infty })\bigr ]\) for all \(f\in \mathcal {L}(\mathcal {X})\).

5.4 Imprecise Continuous-Time Markov Chains

We now move on to the discussion about (imprecise) continuous-time Markov chains. We have already encountered this setting several times in the preceding discussions but have generally skipped over any details. Let us recall from Sect. 5.2 that continuous-time stochastic processes are identified with a time-dimension \(\mathbb {T}=\mathbb {R}_{\geq 0}\) and that the elements ω of the outcome space of paths Ω are maps \(\omega :\mathbb {R}_{\geq 0}\to \mathcal {X}\). The measure-theoretic definition is then as before, where we consider the abstract probability space \((\varOmega ,\mathcal {F},P)\), and the stochastic process \(\{X_t\}_{t\in \mathbb {R}_{\geq 0}}\) is a family of random variables on this space. Furthermore, measure-theoretic definitions of (homogeneous) continuous-time Markov chains (CTMCs) have already been encountered in Definitions 5.5 and 5.6.

How, then, can these models be interpreted? Let us start by considering the simplest case, viz., a precise and homogeneous Markov chain in continuous-time. According to the previous definitions, this is a stochastic process such that

  1. 1.

    \(P(X_t\,\vert \, X_{s_1},\ldots ,X_{s_n}) = P(X_t\,\vert \,X_{s_n})\) for all s 1 < ⋯ < s n < t in \(\mathbb {R}_{\geq 0}\), and

  2. 2.

    P(X t | X s) = P(X ts | X 0) for all s < t in \(\mathbb {R}_{\geq 0}\).

The immediate difficulty of moving on from this abstract representation is that the time-dimension is now, in a sense, too big to use any of the previous representations. For instance, we could try to draw a ‘continuous-time’ probability tree, where the local model of a situation with terminal state w is given by a probability mass function P(X t | X 0 = w ). But what is the time t that we should use? When we were working in discrete-time, the approach was to use the next time point, as viewed from the current situation. But of course, there is no ‘next’ time t when working in continuous-time! This difficulty of using graphical representations is the main reason that we have postponed the treatment of continuous-time processes until now, thereby hopefully allowing the reader to first develop some graphical intuition for the discrete-time case.

Nevertheless, all is not lost; the first interpretation that we will consider is to view continuous-time processes as limits of discrete-time ones. To this end, it will be convenient to consider the transition-matrix T associated with a homogeneous DTMC. Let us recall from Sects. 5.2.2 and 5.2.3 that the elements of such a matrix represent the ‘transition probabilities’ of the system, that is, the probability of moving from a state x to a state y, in one time step:

$$\displaystyle \begin{aligned} T(x,y) = P(X_1=y\,\vert\,X_0=x)\,. \end{aligned}$$

We can use this formalism to interpret the continuous-time case, by simply ‘fixing the length of the step’. That is, consider some ‘step size’ Δ > 0. Then, for a homogeneous CTMC, we known that

$$\displaystyle \begin{aligned} P(X_{t+\varDelta}\,\vert\,X_t) = P(X_\varDelta\,\vert\,X_0)\,, \end{aligned}$$

for all \(t\in \mathbb {R}_{\geq 0}\), so we can collect these ‘transition probabilities’ in a matrix T Δ:

$$\displaystyle \begin{aligned} T_\varDelta(x,y) :\!= P(X_\varDelta=y\,\vert\,X_0=x)\quad \quad \text{for all }x,y\in\mathcal{X}\,. \end{aligned}$$

Clearly, the elements of T Δ are the probabilities for the system to end up in a state y, if it is currently in a state x, after a time duration of Δ has elapsed. Provided, then, that we are not interested in a granularity of the time-dimension that is finer than Δ, this representation suffices. The matrix T Δ can be associated with a DTMC, and all the previous results can be used. For instance, for any multiple \(n\in \mathbb {N}\) of Δ, we use Proposition 5.2 to find that

$$\displaystyle \begin{aligned} P(X_{n\varDelta}=y\,\vert\,X_0=x) = T_\varDelta^n(x,y)\,. \end{aligned}$$

But, of course, the point of using the continuous-time representation is that we are interested in an arbitrarily fine granularity of the time-dimension. In particular, the measure-theoretic definition encodes this arbitrary granularity, and it seems a waste to only focus on the restriction to a single step size Δ. The ‘trick’, then, is to take the limit as Δ goes to zero, and somehow usefully represent this limit. It is hopefully clear from the above discussion that, as we decrease Δ further and further, the associated transition matrix T Δ covers increasingly smaller steps along the time-dimension. And, for each such positive Δ, we can associate a discrete-time Markov chain and use all the previous interpretations that we developed.

We first remark that the naive limit does not encode a lot of information; ignoring possible issues of continuity, it trivially holds that

$$\displaystyle \begin{aligned} \lim_{\varDelta\to 0^+} P(X_{\varDelta}=y\,\vert\,X_0=x) = P(X_0=y\,\vert\,X_0=x) = \begin{cases} 1 & \text{if }y=x\text{, and} \\ 0 & \text{otherwise.} \end{cases} \end{aligned} $$
(5.5)

In matrix notation this reads as \(\lim _{\varDelta \to 0^+}T_\varDelta = I\), where I denotes the \(\lvert \mathcal {X}\rvert \times \lvert \mathcal {X}\rvert \) identity matrix. Colloquially, we might understand this as saying that ‘if time does not evolve, the system does not change’. This is clearly an almost tautological statement to make of what may be interpreted as a dynamical system. So let us consider how the system does change as time evolves. The natural representation for this is obviously the derivative of the transition matrix T Δ; this is the limit interpretation that we shall use. Ignoring technical issues of differentiability, we have

$$\displaystyle \begin{aligned} \frac{\mathrm{d}\,T_\varDelta}{\mathrm{d}\,\varDelta}\bigg\vert_{\varDelta=0} = \lim_{\varDelta\to0^+}\frac{T_\varDelta - I}{\varDelta} =: Q\,, \end{aligned} $$
(5.6)

where we have used the previous observation that T 0 = I. On the right-hand side, the term Q is called the transition rate matrix of the homogeneous CTMC (or sometimes simply the rate matrix). It is clear from the above definition that it encodes the rate of change of the transition probabilities around time zero. It satisfies the following properties:

Definition 5.16 (Transition Rate Matrix)

A real-valued \(\lvert \mathcal {X}\rvert \times \lvert \mathcal {X}\rvert \) matrix Q is called a transition rate matrix if, for all \(x\in \mathcal {X}\), it holds that

  1. 1.

    Q(x, y) ≥ 0 for all \(y\in \mathcal {X}\) such that x ≠ y and

  2. 2.

    \(\sum _{y\in \mathcal {X}}Q(x,y)=0\).

The elements Q(x, y) of a rate matrix can be interpreted as the ‘speed’ with which the process moves from the state x to the state y. In the above definition, the two conditions imply that the diagonal elements Q(x, x) are always non-positive. On the other hand, the first condition states that the off-diagonal elements are non-negative. Combined this can be understood as saying that the system will move ‘out’ of the current state (the non-positivity of the diagonal elements) and ‘into’ some other states (the non-negativity of the off-diagonals).

A more concrete way to interpret the rate-matrix is through a linearised approximation of the transition probabilities over a small enough time step. That is, it follows from Eq. (5.6) that, for ‘small enough’ Δ > 0, it holds that Q ≈ (T Δ − I)1∕Δ; hence also

$$\displaystyle \begin{aligned} T_\varDelta \approx I + \varDelta Q\,. \end{aligned} $$
(5.7)

We therefore see that the matrix Q can be used to approximately compute the transition probabilities over a small enough time step.

An obvious next question is if we can extrapolate this to compute the matrix T t that contains the transition probabilities over an arbitrary duration t. Indeed we can, although it requires a bit of setup. For any \(t\in \mathbb {R}_{\geq 0}\), first define the transition matrix of the CTMC after time t:

$$\displaystyle \begin{aligned} T_t(x,y) :\!= P(X_{t}=y\,\vert\,X_0=x)\quad \quad \text{for all }x,y\in\mathcal{X}\,. \end{aligned}$$

Then we differentiate in t; to this end, first fix Δ > 0, and use the Markov property and homogeneity to derive that T t+Δ = T t T Δ = T Δ T t (c.f. Proposition 5.2). Then we proceed by using Eq. (5.6):

$$\displaystyle \begin{aligned} \frac{\mathrm{d}\,T_t}{\mathrm{d}\,t} = \lim_{\varDelta\to0^+} \frac{T_{t+\varDelta} - T_t}{\varDelta} = \lim_{\varDelta\to0^+} \frac{T_\varDelta T_t - T_t}{\varDelta} = \left(\lim_{\varDelta\to0^+} \frac{T_\varDelta - I}{\varDelta}\right)T_t = QT_t\,. \end{aligned}$$

Using also Eq. (5.5), we can now write the matrix differential equation

$$\displaystyle \begin{aligned} \frac{\mathrm{d}\,T_t}{\mathrm{d}\,t} = QT_t\,,\quad \quad T_0 = I\,, \end{aligned}$$

whose solution is the matrix exponential of Qt:

$$\displaystyle \begin{aligned} T_t = e^{Qt}\,. \end{aligned}$$

We recall from Proposition 5.4 that the dynamic behaviour of a homogeneous discrete-time Markov chain can be characterised by a single transition matrix T and that therefore this matrix constitutes the canonical parameter of the process. Because the matrix Q can be used to (re-)construct the transition matrices of a homogeneous CTMC over any time duration, it plays the same role here.

Proposition 5.7

Let \(\{X_t\}_{t\in \mathbb {R}_{\geq 0}}\) be a continuous-time homogeneous Markov chain, with transition rate matrix Q as defined above. Then for all \(t\in \mathbb {R}_{\geq 0}\) , the transition probabilities P(X t = y | X 0 = x), \(x,y\in \mathcal {X}\) after time t are given by the elements T t(x, y) of the transition matrix T t = e Qt.

While we do not aim to give a complete treatment on the interpretation of the matrix exponential, some properties are worth pointing out. First of all, it can be defined analogously to the exponential function of real numbers, that is, through a Taylor expansion around zero. Specifically, it holds that

$$\displaystyle \begin{aligned} T_t = e^{Qt} :\!= \sum_{k=0}^{+\infty} \frac{t^kQ^k}{k!}\,. \end{aligned}$$

Thus, the approximation in Eq. (5.7) can be seen as a first-order truncation of the series above.

As a second important point, we can consider the entire family of transition matrices T t for all \(t\in \mathbb {R}_{\geq 0}\). Then this family constitutes a semi-group of transition matrices, and Q is the generator of this semi-group. Specifically, it holds that T t+s = T t T s for all \(t,s\in \mathbb {R}_{\geq 0}\)—this is called the semi-group property. Observe that it is analogous to the result in Proposition 5.2 and that we already used this property for the matrix T t+Δ when constructing the derivative.

These properties immediately yield a different representation for the matrix exponential, which will be convenient further on. We omit the proof.

Proposition 5.8

Let \(\{X_t\}_{t\in \mathbb {R}_{\geq 0}}\) be a continuous-time homogeneous Markov chain, with transition rate matrix Q, and let T t be the associated family of transition matrices. Then, for all \(t\in \mathbb {R}_{\geq 0}\) , it holds that

$$\displaystyle \begin{aligned} T_t = \lim_{n\to+\infty} \left(I+\frac{t}{n}Q\right)^n\,. \end{aligned}$$

One way to think about this is that, for some fixed (but large enough) \(n\in \mathbb {N}\), each factor (I + tnQ) is, due to Eq. (5.7), roughly the ‘small step’ transition matrix T tn. The multiplication of these n terms (I + tnQ)n is then analogous to the composition in Proposition 5.2, whereby we cover the duration t in steps of size tn. It should be noted that this only becomes exact in the limit (as the result states), but the intuition behind it is the same regardless.

Furthermore, let us again remark that the transition-matrix representation is also convenient in that it offers an alternative representation of the conditional expectation operator:

Proposition 5.9

Let \(\{X_t\}_{t\in \mathbb {R}_{\geq 0}}\) be a continuous-time homogeneous Markov chain, with transition rate matrix Q, and let T t be the associated family of transition matrices. Then, for all \(f\in \mathcal {L}(\mathcal {X})\) , all \(t\in \mathbb {R}_{\geq 0}\) and all \(x\in \mathcal {X}\) , it holds that \(\mathbb {E}[f(X_t)\,\vert \,X_0=x]=\bigl [T_tf\bigr ](x)\) .

Proof

Analogous to the proof of Proposition 5.3. □

Let us consider the importance of the homogeneity assumption in the preceding exposition. Indeed, it is this property that crucially allows the parametrisation to only require a single rate matrix Q. More generally, we may consider a non-homogeneous CTMC and consider the derivatives at each time point; first write the transition matrix for the interval [s, t] as

$$\displaystyle \begin{aligned} T_s^t(x,y) :\!= P(X_t=y\,\vert\,X_s=x)\,, \end{aligned}$$

and differentiate to obtain

$$\displaystyle \begin{aligned} \frac{\mathrm{d}\,T_s^t}{\mathrm{d}\,t}\bigg\vert_{t=s} = \lim_{t\to s^+}\frac{T_s^t-I}{t-s} =: Q_s\,, \end{aligned}$$

whence the parametrisation now requires an entire family Q s of rate matrices—one for each point in time. Note, though, that these matrices are still transition rate matrices, in that they satisfy the properties in Definition 5.16. However, the corresponding matrix differential equation is no longer solved by a simple matrix exponential.

More generally still, for arbitrary continuous-time stochastic processes (that are neither homogeneous nor Markov) we may consider the transition rates (derivatives) not only for specific points in time but also for specific histories leading up to that time. For instance, with s = s 1, …, s n and t in \(\mathbb {R}_{\geq 0}\) and \(x_{\mathbf {s}}\in \mathcal {X}^n\), we may write

$$\displaystyle \begin{aligned} \frac{\mathrm{d}}{\mathrm{d}\,u}P\bigl(X_u=y\,\vert\,X_{\mathbf{s}}=x_{\mathbf{s}},X_t=x\bigr)\bigg\vert_{u=t} =: Q_{x_{\mathbf{s}},t}(x,y)\,. \end{aligned} $$
(5.8)

Thus, the parametrisation requires the specification of a transition rate matrix for each point in time and for each possible history before that time. It should be clear that this leads to a rather unwieldy process specification, which again goes some way in illustrating why homogeneity and Markovianity are such popular simplifying assumptions.

5.4.1 Imprecise Continuous-Time Markov Chains

With the notation and concepts for precise continuous-time stochastic processes in place, let us now turn to the imprecise generalisation. In what follows, we will consider imprecise, homogeneous continuous-time Markov chains (ICTMC). As before, we start by considering the abstract sets-of-measures definition:

Definition 5.17 (ICTMC as set of processes)

An imprecise continuous-time Markov chain is a set \(\mathbb {P}\) of probability measures on the measurable space \((\varOmega ,\mathcal {F})\) of (continuous-time) paths, with associated lower expectation operator \( \underline {\mathbb {E}}_{\mathbb {P}}\) such that, for all \(f\in \mathcal {L}(\mathcal {X})\) and all \(s_1,\ldots ,s_n,t\in \mathbb {R}_{\geq 0}\) such that s 1 < ⋯ < s n < t, it holds that

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_t)\,\big\vert\,X_{s_1},\ldots,X_{s_n}\bigr] = \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_t)\,\big\vert\,X_{s_n}\bigr]\,. \end{aligned}$$

Furthermore, an imprecise continuous-time Markov chain is called homogeneous if, for all \(s,t\in \mathbb {R}_{\geq 0}\), s < t, and all \(f\in \mathcal {L}(\mathcal {X})\), it holds that \( \underline {\mathbb {E}}_{\mathbb {P}}\bigl [f(X_t)\,\big \vert \,X_s\bigr ]= \underline {\mathbb {E}}_{\mathbb {P}}\bigl [f(X_{t-s})\,\big \vert \,X_0\bigr ]\).

As in the discussion about imprecise discrete-time Markov chains, we distinguish between the definition by epistemic irrelevance—which is what is used above—and the definition by strong independence, which would imply that all \(P\in \mathbb {P}\) are precise (homogeneous) Markov chains, and which we are explicitly not using.

Let us now consider the parametrisation of such an ICTMC. We recall that in the precise case, the canonical parameter is a single transition rate matrix Q. In contrast, for the imprecise case, the ‘parameter’ of interest is a set \(\mathcal {Q}\) of transition rate matrices. Because a precise homogeneous CTMC is identified with a rate matrix Q, it is clear that such a set \(\mathcal {Q}\) induces a set of precise processes: simply consider all processes for which the associated rate matrix is included in \(\mathcal {Q}\). However, this induced set then only includes homogeneous Markov processes, and, as remarked above, we aim to relax these independence assumptions. Using the parametrisation of more general precise processes, we introduce the notion of compatibility with a given set of rate matrices:

Definition 5.18

Let \(\mathcal {Q}\) be a set of transition rate matrices. Then a continuous-time stochastic process P is called compatible with \(\mathcal {Q}\) if, for all s = s 1, …, s n and \(t\in \mathbb {R}_{\geq 0}\) such that s 1 < ⋯ < s n < t, and all \(x_{\mathbf {s}}\in \mathcal {X}^n\), it holds that \(Q_{x_{\mathbf {s}},t}\in \mathcal {Q}\), where \(Q_{x_{\mathbf {s}},t}\) is the time- and history-dependent rate matrix associated with P, as in Eq. (5.8).

It can be verified that this definition includes, as a special case, the compatibility of homogeneous CTMCs with rate matrix Q, with a given set \(\mathcal {Q}\), if \(Q\in \mathcal {Q}\). Similarly, a non-homogeneous CTMC that is parametrised by a family Q t is compatible with such a set if \(Q_t\in \mathcal {Q}\) for all \(t\in \mathbb {R}_{\geq 0}\). The ICTMC \(\mathbb {P}\) corresponding to a given set \(\mathcal {Q}\), then, is taken to be the largest set of continuous-time stochastic processes that are compatible with this \(\mathcal {Q}\). While perhaps not obvious, it can be proven that this set \(\mathbb {P}\) is then indeed a homogeneous ICTMC, in the sense that its corresponding lower expectations satisfy the properties of Definition 5.17.

With this ICTMC in place, let us now again consider the main inferential challenge: how to compute the corresponding lower expectation. A first attempt could be to use Propositions 5.7 and 5.9 and optimise over \(\mathcal {Q}\); for some fixed \(f\in \mathcal {L}(\mathcal {X})\), this would give

$$\displaystyle \begin{aligned} \inf_{Q\in\mathcal{Q}} e^{Qt}f\,. \end{aligned}$$

If we think about what this computes, we come to the conclusion that for each \(Q\in \mathcal {Q}\), there is a homogeneous CTMC for which the conditional expectation of f at time \(t\in \mathbb {R}_{\geq 0}\) is indeed e Qt f. We therefore conclude that this computes the lower expectation with respect to all homogeneous CTMCs that are compatible with \(\mathcal {Q}\). But what about the non-homogeneous and/or non-Markovian stochastic processes that we know are also included in \(\mathbb {P}\)? It turns out that the above expression ignores their corresponding expectations and hence only yields an upper bound on the actual lower expectation. In other words, we cannot use this expression to compute the lower expectation for \(\mathbb {P}\).

The way to proceed is analogous to the approach in Sect. 5.3.2; we first define a local ‘lower’ operator and then find the global lower expectation using repeated compositions of this operator through the law of iterated lower expectation. To this end, we associate with the set \(\mathcal {Q}\) the corresponding lower transition rate operator \( \underline {Q}:\mathcal {L}(\mathcal {X})\to \mathcal {L}(\mathcal {X})\), which is defined for all \(f\in \mathcal {L}(\mathcal {X})\) and all \(x\in \mathcal {X}\) as

$$\displaystyle \begin{aligned} \bigl[\underline{Q}f\bigr](x) :\!= \inf_{Q\in\mathcal{Q}} \bigl[Qf\bigr](x)\,. \end{aligned} $$
(5.9)

Intuitively, for small Δ > 0, we can then approximate the lower expectation as

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_\varDelta)\,\vert\,X_0\bigr] \approx \inf_{Q\in\mathcal{Q}} (I+\varDelta Q)f = (I+\varDelta\underline{Q})f\,, \end{aligned}$$

where the approximation is again due to Eq. (5.7). It turns out that we can make this exact and extend the result to any time t, analogously to Proposition 5.8:

Theorem 5.5

Let \(\mathcal {Q}\) be a non-empty set of transition rate matrices, and let \( \underline {Q}\) be the corresponding lower transition rate operator, as in Eq. (5.9). Then, for all \(t\in \mathbb {R}_{\geq 0}\) , there is an operator \( \underline {T}_t:\mathcal {L}(\mathcal {X})\to \mathcal {L}(\mathcal {X})\) , such that

$$\displaystyle \begin{aligned} \underline{T}_t = \lim_{n\to+\infty}\left(I+\frac{t}{n}\underline{Q}\right)^n\,. \end{aligned}$$

These operators satisfy \( \underline {T}_0=I\), \( \underline {T}_{t+s}= \underline {T}_t \underline {T}_s\) for all \(t,s\in \mathbb {R}_{\geq 0}\) and \({\mathrm {d}}/{\mathrm {d}t} \underline {T}_t= \underline {Q} \underline {T}_t\).

Observe that this family of operators \( \underline {T}_t\) satisfies in large part the same properties as the matrix exponentials of Qt—c.f. the discussion after Proposition 5.7—with the main difference being that they are non-linear operators. We can now finally present the result that allows the computation of lower expectations for ICTMCs.

Theorem 5.6

Let \(\mathcal {Q}\) be a non-empty set of transition rate matrices, with corresponding lower transition rate operator \( \underline {Q}\) , and let \(\mathbb {P}\) be the corresponding ICTMC. Suppose that \(\mathcal {Q}\) is closed, convex and has separately specified rows (i.e. is closed under recombination of the rows of its elements). Then, for all \(f\in \mathcal {L}(\mathcal {X})\) , all \(t\in \mathbb {R}_{\geq 0}\) and all \(x\in \mathcal {X}\) , it holds that

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_t)\,\vert\,X_0=x\bigr] = \bigl[\underline{T}_tf\bigr](x)\,. \end{aligned} $$
(5.10)

Observe that this result needs some constraints on the rate matrix set \(\mathcal {Q}\). This can be explained in the sense that the right-hand side of Eq. (5.10) depends, through Theorem 5.5, on the lower transition rate operator \( \underline {Q}\). In turn, \( \underline {Q}\) depends on \(\mathcal {Q}\) through Eq. (5.9). Conversely, the left-hand side (the lower expectation) depends on the set \(\mathbb {P}\), which in turn depends on \(\mathcal {Q}\) through the compatibility as in Definition 5.18. It turns out that for these different dependencies on \(\mathcal {Q}\) to be equivalent, we need some regularity conditions on this latter set—these are the constraints mentioned in the theorem above.

5.4.2 Limits of ICTMCs

Let us finally consider the long-term behaviour of a given homogeneous ICTMC \(\mathbb {P}\) with transition rate matrix set \(\mathcal {Q}\) and associated lower transition rate operator \( \underline {Q}\); we assume these to be fixed in the remainder of this section. What, then, can we say about the lower expectation of a function as time goes to infinity?

Recall that, in the discrete-time case, Theorem 5.4 established a sufficient condition for such a lower expectation to converge. This condition was regularity of the IDTMC. Essentially, this meant that it was possible for the IDTMC to move from any state to any other state, in exactly n steps, for some \(n\in \mathbb {N}\). In the continuous-time case that we consider here, there is a similar condition: upper reachability between all pairs of states.

We first remark that this condition is defined using the conjugate upper transition rate operator defined as \(\overline {Q}f:\!= - \underline {Q}(-f)\) for all \(f\in \mathcal {L}(\mathcal {X})\). The definition of upper reachability is then analogous to that of accessibility in discrete-time but is instead defined using the transition rates, rather than probabilities:

Definition 5.19

Let \(\mathbb {P}\) be an ICTMC with associated upper transition rate operator \(\overline {Q}\), as defined above. For any two states \(x,y\in \mathcal {X}\), y is said to be upper reachable from x, if there is a sequence \(x_0,\ldots ,x_n\in \mathcal {X}\), \(n\in \mathbb {N}\), such that x 0 = x, x n = y and, for all i ∈{1, …, n}, it holds that x i ≠ x i−1 and \(\bigl [\overline {Q}\,\mathbb {I}_{x_i}\bigr ](x_{i-1})>0\).

Let us in particular consider the final condition in this definition. From the conjugacy between the lower and upper transition rate operators, and the definition of the former, we can rewrite this requirement as saying that

$$\displaystyle \begin{aligned} 0 < \bigl[\overline{Q}\,\mathbb{I}_{x_i}\bigr](x_{i-1}) = \sup_{Q\in\mathcal{Q}} \bigl[Q\,\mathbb{I}_{x_i}\bigr](x_{i-1}) = \sup_{Q\in\mathcal{Q}} Q(x_{i-1},x_i)\,. \end{aligned}$$

Thus, upper reachability of y, from x, requires that there exists a sequence of states from x to y such that, at each step in this sequence, there is some transition rate matrix \(Q\in \mathcal {Q}\) which assigns strictly positive ‘speed’ of moving from the current state in this sequence, to the next one. In other words, it should be possible for these transitions to happen according to some of the models in our set \(\mathbb {P}\), but not necessarily all, and there can be a different model allowing for this possibility at each step. This can now be used to state the following result:

Theorem 5.7

Let \(\mathbb {P}\) be an ICTMC and suppose that, for all \(x,y\in \mathcal {X}\) , y is upper reachable from x. Then, there is a unique lower expectation operator \( \underline {\mathbb {E}}_{\mathbb {P}}\bigl [\cdot (X_{+\infty })\bigr ]:\mathcal {L}(\mathcal {X})\to \mathbb {R}\) such that, for all \(f\in \mathcal {L}(\mathcal {X})\) and all \(x\in \mathcal {X}\),

$$\displaystyle \begin{aligned} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{+\infty})\bigr] = \lim_{t\to+\infty} \underline{\mathbb{E}}_{\mathbb{P}}\bigl[f(X_{t})\,\big\vert\,X_0=x\bigr] = \lim_{t\to+\infty}\bigl[\underline{T}_tf\bigr](x)\,. \end{aligned}$$

5.5 Literature and Further Reading

Let us conclude this chapter by providing pointers to the literature on which the material in this chapter is based. We will also briefly discuss some parts of the literature that are related but not quite the same as what we covered here.

First of all, there exists an extensive body of literature on (precise) Markov chains, both in discrete- and in continuous times. It would be nigh impossible to give a complete overview here, but we think that [1, 38] make excellent introductory reads. For a broad and general introduction to the theory for imprecise probability, which lies at the heart of the models that we discussed here, we refer the reader to [3, 50]. The difference between the notions of strong independence and epistemic irrelevance—which we have stressed repeatedly and which is a crucial property of imprecise Markov chains as we treated them here—is discussed, e.g. in [5, 37].

For the interpretation of Markov chains using probability trees, see, for example, [13, 18, 32]. This interpretation is also closely related to the game-theoretic formalisation of probabilities using the theory of martingales (which we did not cover here). The interested reader may want to pursue [13, 32, 49].

For an account of the general theory of Bayesian networks, see [39]. For their imprecise generalisation—credal networks—references [2, 6, 7, 9, 11] discuss a lot of the general theory.

Imprecise discrete-time Markov chains are discussed, e.g. in [15, 17, 27]. For imprecise continuous-time Markov chains, see [30, 44]. A treatment of the matrix exponential, which is crucial to computational methods for CTMCs, is given in [48]. Reference [19] discusses the current state-of-the-art to efficiently compute the imprecise generalisation of the matrix exponential, which we have seen is crucial for computing inferences in ICTMCs.

Detailed treatments on the long-term (limit) behaviour in IDTMCs can be found in [14, 16, 26, 45]. Reference [10] provides the necessary and sufficient conditions for the limit behaviour of ICTMCs, and [19] also discusses computational methods to numerically approximate this limit. We remark that Theorems 5.4 and 5.7 in this chapter are stated in a simplified form compared to their statement in the literature. In particular, the results in [10, 14] are stronger; for instance, [10] in fact provides necessary and sufficient conditions for the convergence of an ICTMC, whereas Theorem 5.7 only states a sufficient condition.

Some examples of the merits of imprecise Markov chains in applications are provided by [40, 46, 47]. A domain for which the applicability of (imprecise) Markov chains has been extensively studied, is queueing theory [8, 33,34,35,36].

A generalisation of Markov chains that we have not discussed, but which is nevertheless important in many practical applications, is hidden Markov chains. There, the stochastic process cannot be observed directly, but only through a noisy measurement model. Their imprecise treatment is discussed, e.g. in [4, 12, 31].

Fields that are closely related to the theory of imprecise Markov chains are controlled Markov processes [21] and Markov decision processes [22, 28, 41, 51]. There also, the process under study has its parameters changed over time. However, the goal there is not to represent uncertainty and change these parameters to compute robust bounds on quantities of interest. Rather, their aim is to optimise the process evolution towards some operational target.

Finally, we again emphasise that our treatment uses epistemic irrelevance, which we distinguish from using strong independence. There is, however, an extended body of literature also on the latter. These alternative models are known as Markov chains under strong independence, e.g. in [27], as interval Markov chains [20, 29, 42, 43] or as Markov set chains [23,24,25].