Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Stochastic Matrices

The theory of Markov chains makes use of stochastic matrices. We therefore begin with a small digression of an algebraic nature.

Definition 5.1

An \(r \times r\) matrix \(Q = ({q}_{ij})\) is said to be stochastic if

  1. 1.

    \({q}_{ij} \geq 0\).

  2. 2.

    \({\sum \nolimits }_{j=1}^{r}{q}_{ij} = 1\) for any \(1 \leq i \leq r\).

A column vector \(f = ({f}_{1},\ldots, {f}_{r})\) is said to be non-negative if \({f}_{i} \geq 0\) for \(1 \leq i \leq r\). In this case we write \(f \geq 0\).

Lemma 5.2

The following statements are equivalent.

  1. (a)

    The matrix Q is stochastic.

  2. (b1)

    For any \(f \geq 0\) we have \(Qf \geq 0\) , and

  3. (b2)

    If \(\mathbf{1} = (1,\ldots, 1)\) is a column vector, then \(Q\mathbf{1} = \mathbf{1}\) , that is the vector \(\mathbf{1}\) is an eigenvector of the matrix Q corresponding to the eigenvalue 1.

  4. (c)

    If \(\mu = ({\mu }_{1},\ldots, {\mu }_{r})\) is a probability distribution, that is \({\mu }_{i} \geq 0\) and \({\sum \nolimits }_{i=1}^{r}{\mu }_{i} = 1\) , then \(\mu Q\) is also a probability distribution.

Proof

If Q is a stochastic matrix, then (b1) and (b2) hold, and therefore (a) implies (b). We now show that (b) implies (a). Consider the column vector \({\delta }_{j}\) all of whose entries are equal to zero, except the j-th entry which is equal to one. Then \({(Q{\delta }_{j})}_{i} = {q}_{ij} \geq 0\). Furthermore, \({(Q\mathbf{1})}_{i} =\sum\nolimits_{j=1}^{r}{q}_{ij}\), and it follows from the equality \(Q\mathbf{1} = \mathbf{1}\) that \({\sum \nolimits }_{j=1}^{r}{q}_{ij} = 1\) for all i, and therefore (b) implies (a).

We now show that (a) implies (c). If \(\mu ^{\prime} = \mu Q\), then \({\mu }_{j}^{\prime} =\sum\nolimits_{i=1}^{r}{\mu }_{i}{q}_{ij}\). Since Q is stochastic, we have \({\mu }_{j}^{\prime} \geq 0\) and

$$\sum\limits_{j=1}^{r}{\mu }_{ j}^{\prime} =\sum\limits_{j=1}^{r}\sum\limits_{i=1}^{r}{\mu }_{ i}{q}_{ij} =\sum\limits_{i=1}^{r}\sum\limits_{j=1}^{r}{\mu }_{ i}{q}_{ij} =\sum\limits_{i=1}^{r}{\mu }_{ i} = 1.$$

Therefore, \(\mu ^{\prime}\) is also a probability distribution.

Now assume that (c) holds. Consider the row vector \({\delta }_{i}\) all of whose entries are equal to zero, except the i-th entry which is equal to one. It corresponds to the probability distribution on the set \(\{1,\ldots, r\}\) which is concentrated at the point i. Then \({\delta }_{i}Q\) is also a probability distribution. If follows that \({q}_{ij} \geq 0\) and \({\sum \nolimits }_{j=1}^{r}{q}_{ij} = 1\), that is (c) implies (a).

Lemma 5.3

Let \(Q^{\prime} = ({q}_{ij}^{\prime})\) and \(Q^{\prime\prime} = ({q}_{ij}^{\prime\prime})\) be stochastic matrices and \(Q = Q^{\prime}Q^{\prime\prime} = ({q}_{ij})\). Then \(Q\) is also a stochastic matrix. If \({q}_{ij}^{\prime\prime} > 0\) for all \(i,j\), then \({q}_{ij} > 0\) for all \(i,j\).

Proof

We have

$${q}_{ij} =\sum\limits_{k=1}^{r}{q}_{ ik}^{\prime}{q}_{kj}^{\prime\prime}.$$

Therefore, \({q}_{ij} \geq 0\). If all \({q}_{kj}^{\prime\prime} > 0\), then \({q}_{ij} > 0\) since \({q}_{ik}^{\prime} \geq 0\) and \({\sum \nolimits }_{k=1}^{r}{q}_{ik}^{\prime} = 1\). Furthermore,

$$\sum\limits_{j=1}^{r}{q}_{ ij} =\sum\limits_{j=1}^{r}\sum\limits_{k=1}^{r}{q}_{ ik}^{\prime}{q}_{kj}^{\prime\prime} =\mathop {\sum}_{k=1}^{r}{q}_{ ik}^{\prime}\sum\limits_{j=1}^{r}{q}_{ kj}^{\prime\prime} =\sum\limits_{k=1}^{r}{q}_{ ik}^{\prime} = 1.$$

Remark 5.4

We can also consider infinite matrices \(Q = ({q}_{ij})\), \(1 \leq i,j < \infty \). An infinite matrix is said to be stochastic if

  1. 1.

    \({q}_{ij} \geq 0\), and

  2. 2.

    \({\sum \nolimits }_{j=1}^{\infty }{q}_{ij} = 1\) for any \(1 \leq i < \infty \).

It is not difficult to show that Lemmas 5.2 and 5.3 remain valid for infinite matrices.

2 Markov Chains

We now return to the concepts of probability theory. Let \(\Omega \) be the space of sequences \(({\omega }_{0},\ldots, {\omega }_{n})\), where \({\omega }_{k} \in X =\{ {x}^{1},\ldots, {x}^{r}\}\), \(0 \leq k \leq n\). Without loss of generality we may identify X with the set of the first r integers, \(X =\{ 1,\ldots, r\}\).

Let \(\mathrm{P}\) be a probability measure on Ω. Sometimes we shall denote by \({\omega }_{k}\) the random variable which assigns the value of the k-th element to the sequence \(\omega = ({\omega }_{0},\ldots, {\omega }_{n})\). It is usually clear from the context whether \({\omega }_{k}\) stands for such a random variable or simply the k-th element of a particular sequence. We shall denote the probability of the sequence \(({\omega }_{0},\ldots, {\omega }_{n})\) by \(\mathrm{p}({\omega }_{0},\ldots, {\omega }_{n})\). Thus,

$$\mathrm{p}({i}_{0},\ldots, {i}_{n}) =\mathrm{ P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{n} = {i}_{n}).$$

Assume that we are given a probability distribution \(\mu = ({\mu }_{1},\ldots, {\mu }_{r})\) on X and n stochastic matrices \(P(1),\ldots, P(n)\) with \(P(k) = ({p}_{ij}(k))\).

Definition 5.5

The Markov chain with the state space X generated by the initial distribution \(\mu \) on X and the stochastic matrices \(P(1),\ldots, P(n)\) is the probability measure \(\mathrm{P}\) on Ω such that

$$\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{n} = {i}_{n}) = {\mu }_{{i}_{0}} \cdot {p}_{{i}_{0}{i}_{1}}(1) \cdot \ldots \cdot {p}_{{i}_{n-1}{i}_{n}}(n)$$
(5.1)

for each \({i}_{0},\ldots, {i}_{n} \in X\).

The elements of X are called the states of the Markov chain. Let us check that (5.1) defines a probability measure on \(\Omega \). The inequality \(\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{n} = {i}_{n}) \geq 0\) is clear. It remains to show that

$$\sum\limits_{{i}_{0}=1}^{r}\ldots \sum\limits_{{i}_{n}=1}^{r}\mathrm{P}({\omega }_{ 0} = {i}_{0},\ldots, {\omega }_{n} = {i}_{n}) = 1.$$

We have

$$\sum\limits_{{i}_{0}=1}^{r}\ldots \sum\limits_{{i}_{n}=1}^{r}\mathrm{P}({\omega }_{ 0} = {i}_{0},\ldots, {\omega }_{n} = {i}_{n})$$
$$=\sum\limits_{{i}_{0}=1}^{r}\ldots \sum\limits_{{i}_{n}=1}^{r}{\mu }_{{ i}_{0}} \cdot {p}_{{i}_{0},{i}_{1}}(1) \cdot \ldots \cdot {p}_{{i}_{n-1}{i}_{n}}(n). $$

We now perform the summation over all the values of \({i}_{n}\). Note that i n is only present in the last factor in each term of the sum, and the sum \({\sum \nolimits }_{{i}_{n}=1}^{r}{p}_{{i}_{n-1}{i}_{n}}(n)\) is equal to one, since the matrix P(n) is stochastic. We then fix \({i}_{0},\ldots, {i}_{n-2}\), and sum over all the values of \({i}_{n-1}\), and so on. In the end we obtain \({\sum \nolimits }_{{i}_{0}=1}^{r}{\mu }_{{i}_{0}}\), which is equal to one, since \(\mu \) is a probability distribution.

In the same way one can prove the following statement:

$$\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{k} = {i}_{k}) = {\mu }_{{i}_{0}} \cdot {p}_{{i}_{0}{i}_{1}}(1) \cdot \ldots \cdot {p}_{{i}_{k-1}{i}_{k}}(k)$$

for any \(1 \leq {i}_{0},\ldots, {i}_{k} \leq r\), \(k \leq n\). This equality shows that the induced probability distribution on the space of sequences of the form \(({\omega }_{0},\ldots, {\omega }_{k})\) is also a Markov chain generated by the initial distribution \(\mu \) and the stochastic matrices \(P(1),\ldots, P(k)\).

The matrices P(k) are called the transition probability matrices, and the matrix entry \({p}_{ij}(k)\) is called the transition probability from the state i to the state j at time k. The use of these terms is justified by the following calculation.

Assume that \(\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{k-2} = {i}_{k-2},{\omega }_{k-1} = i) > 0\). We consider the conditional probability \(\mathrm{P}({\omega }_{k} = j\vert {\omega }_{0} = {i}_{0},\ldots, {\omega }_{k-2} = {i}_{k-2},{\omega }_{k-1} = i)\). By the definition of the measure \(\mathrm{P}\),

$$\mathrm{P}({\omega }_{k} = j\vert {\omega }_{0} = {i}_{0},\ldots, {\omega }_{k-2} = {i}_{k-2},{\omega }_{k-1} = i)$$
$$= \frac{\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{k-2} = {i}_{k-2},{\omega }_{k-1} = i,{\omega }_{k} = j)} {\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{k-2} = {i}_{k-2},{\omega }_{k-1} = i)}$$
$$= \frac{{\mu }_{{i}_{0}} \cdot {p}_{{i}_{0}{i}_{1}}(1) \cdot \ldots \cdot {p}_{{i}_{k-2}i}(k - 1) \cdot {p}_{ij}(k)} {{\mu }_{{i}_{0}} \cdot {p}_{{i}_{0}{i}_{1}}(1) \cdot \ldots \cdot {p}_{{i}_{k-2}i}(k - 1)} = {p}_{ij}(k).$$

The right-hand side here does not depend on \({i}_{0},\ldots, {i}_{k-2}\). This property is sometimes used as a definition of a Markov chain. It is also easy to see that \(\mathrm{P}({\omega }_{k} = j\vert {\omega }_{k-1} = i) = {p}_{ij}(k)\). (This is proved below for the case of a homogeneous Markov chain.)

Definition 5.6

A Markov chain is said to be homogeneous if \(P(k) = P\) for a matrix P which does not depend on k, \(1 \leq k \leq n\).

The notion of a homogeneous Markov chain can be understood as a generalization of the notion of a sequence of independent identical trials. Indeed, if all the rows of the stochastic matrix \(P = ({p}_{ij})\) are equal to \(({p}_{1},\ldots, {p}_{r})\), where \(({p}_{1},\ldots, {p}_{r})\) is a probability distribution on X, then the Markov chain with such a matrix P and the initial distribution \(({p}_{1},\ldots, {p}_{r})\) is a sequence of independent identical trials.

In what follows we consider only homogeneous Markov chains. Such chains can be represented with the help of graphs. The vertices of the graph are the elements of X. The vertices i and j are connected by an oriented edge if \({p}_{ij} > 0\). A sequence of states \(({i}_{0},{i}_{1},\ldots, {i}_{n})\) which has a positive probability can be represented as a path of length n on the graph starting at the point \({i}_{0}\), then going to the point \({i}_{1}\), and so on. Therefore, a homogeneous Markov chain can be represented as a probability distribution on the space of paths of length n on the graph.

Let us consider the conditional probabilities \(\mathrm{P}({\omega }_{s+l} = j\vert {\omega }_{l} = i)\). It is assumed here that \(\mathrm{P}({\omega }_{l} = i) > 0\). We claim that

$$\mathrm{P}({\omega }_{s+l} = j\vert {\omega }_{l} = i) = {p}_{ij}^{(s)}\, $$

where \({p}_{ij}^{(s)}\) are elements of the matrix \({P}^{s}\). Indeed,

$$\mathrm{P}({\omega }_{s+l} = j\vert {\omega }_{l} = i) = \frac{\mathrm{P}({\omega }_{s+l} = j,{\omega }_{l} = i)} {\mathrm{P}({\omega }_{l} = i)}$$
$$= \frac{{\sum \nolimits }_{{i}_{0}=1}^{r}\ldots {\sum \nolimits }_{{i}_{l-1}=1}^{r}\sum\nolimits_{{i}_{l+1}=1}^{r}\ldots {\sum \nolimits }_{{i}_{s+l-1}=1}^{r}\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{l}\! =\! i,\ldots, {\omega }_{s+l}\! =\! j)} {{\sum \nolimits }_{{i}_{0}=1}^{r}\ldots {\sum \nolimits }_{{i}_{l-1}=1}^{r}\mathrm{P}({\omega }_{0} = {i}_{0},\ldots, {\omega }_{l} = i)}$$
$$= \frac{{\sum \nolimits }_{{i}_{0}=1}^{r}\ldots {\sum \nolimits }_{{i}_{l-1}=1}^{r}\sum\limits_{{i}_{l+1}=1}^{r}\ldots {\sum \nolimits }_{{i}_{s+l-1}=1}^{r}{\mu }_{{i}_{0}}{p}_{{i}_{0}{i}_{1}}\ldots {p}_{{i}_{l-1}i}{p}_{i{i}_{l+1}}\ldots {p}_{{i}_{s+l-1}j}} {{\sum \nolimits }_{{i}_{0}=1}^{r}\ldots {\sum \nolimits }_{{i}_{l-1}=1}^{r}{\mu }_{{i}_{0}}{p}_{{i}_{0}{i}_{1}}\ldots {p}_{{i}_{l-1}i}}$$
$$= \frac{{\sum \nolimits }_{{i}_{0}=1}^{r}\ldots {\sum \nolimits }_{{i}_{l-1}=1}^{r}{\mu }_{{i}_{0}}{p}_{{i}_{0}{i}_{1}}\ldots {p}_{{i}_{l-1}i}\sum\nolimits_{{i}_{l+1}=1}^{r}\ldots {\sum \nolimits }_{{i}_{s+l-1}=1}^{r}{p}_{i{i}_{l+1}}\ldots {p}_{{i}_{s+l-1}j}} {{\sum \nolimits }_{{i}_{0}=1}^{r}\ldots {\sum \nolimits }_{{i}_{l-1}=1}^{r}{\mu }_{{i}_{0}}{p}_{{i}_{0}{i}_{1}}\ldots {p}_{{i}_{l-1}i}}$$
$$=\sum\limits_{{i}_{l+1}=1}^{r}\ldots \sum\limits_{{i}_{s+l-1}=1}^{r}{p}_{ i{i}_{l+1}}\ldots {p}_{{i}_{s+l-1}j} = {p}_{ij}^{(s)}. $$

Thus the conditional probabilities \({p}_{ij}^{(s)} =\mathrm{ P}({\omega }_{s+l} = j\vert {\omega }_{l} = i)\) do not depend on l. They are called s-step transition probabilities. A similar calculation shows that for a homogeneous Markov chain with initial distribution \(\mu \),

$$\mathrm{P}({\omega }_{s} = j) = {(\mu {P}^{s})}_{ j} =\sum\limits _{i=1}^{r}{\mu }_{ i}{p}_{ij}^{(s)}.$$
(5.2)

Note that by considering infinite stochastic matrices, Definition 5.5 and the argument leading to (5.2) can be generalized to the case of Markov chains with a countable number of states.

3 Ergodic and Non-ergodic Markov Chains

Definition 5.7

A stochastic matrix P is said to be ergodic if there exists s such that the s-step transition probabilities \({p}_{ij}^{(s)}\) are positive for all i and j. A homogeneous Markov chain is said to be ergodic if it can be generated by some initial distribution and an ergodic stochastic matrix.

By (5.2), ergodicity implies that in s steps one can, with positive probability, proceed from any initial state i to any final state j.

It is easy to provide examples of non-ergodic Markov Chains. One could consider a collection of non-intersecting sets \({X}_{1},\ldots, {X}_{n}\), and take \(X ={ \bigcup \nolimits }_{k=1}^{n}{X}_{k}\). Suppose the transition probabilities \({p}_{ij}\) are such that \({p}_{ij} = 0\), unless i and j belong to consecutive sets, that is \(i \in {X}_{k}\), \(j \in {X}_{k+1}\) or \(i \in {X}_{n}\), \(j \in {X}_{1}\). Then the matrix P is block diagonal, and any power of P will contain zeros, thus P will not be ergodic.

Another example of a non-ergodic Markov chain arises when a state j cannot be reached from any other state, that is \({p}_{ij} = 0\) for all \(i\neq j\). Then the same will be true for the s-step transition probabilities.

Finally, there may be non-intersecting sets \({X}_{1},\ldots, {X}_{n}\) such that \(X ={ \bigcup \nolimits }_{k=1}^{n}{X}_{k}\), and the transition probabilities \({p}_{ij}\) are such that \({p}_{ij} = 0\), unless i and j belong to the same set \({X}_{k}\). Then the matrix is not ergodic.

The general classification of Markov chains will be discussed in Sect. 5.6.

Definition 5.8

A probability distribution π on X is said to be stationary (or invariant) for a matrix of transition probabilities P if \(\pi P = \pi \).

Formula (5.2) means that if the initial distribution \(\pi \) is a stationary distribution, then the probability distribution of any \({\omega }_{k}\) is given by the same vector π and does not depend on k. Hence the term “stationary”.

Theorem 5.9 (Ergodic Theorem for Markov chains).

Given a Markov chain with an ergodic matrix of transition probabilities P, there exists a unique stationary probability distribution \(\pi = ({\pi }_{1},\ldots, {\pi }_{r})\). The n-step transition probabilities converge to the distribution \(\pi \), that is

$$\lim\limits_{n\rightarrow \infty }{p}_{ij}^{(n)} = {\pi }_{ j}.$$

The stationary distribution satisfies \({\pi }_{j} > 0\) for \(1 \leq j \leq r\).

Proof

Let \(\mu ^{\prime} = ({\mu ^{\prime}}_{1},\ldots, {\mu ^{\prime}}_{r}),\mu ^{\prime\prime} = ({\mu ^{\prime\prime}}_{1},\ldots, {\mu ^{\prime\prime}}_{r})\) be two probability distributions on the space X. We set \(d(\mu ^{\prime},\mu ^{\prime\prime}) = \frac{1} {2}\sum\nolimits_{i=1}^{r}\vert {\mu ^{\prime}}_{i} - {\mu ^{\prime\prime}}_{i}\vert \). Then d can be viewed as a distance on the space of probability distributions on X, and the space of distributions with this distance is a complete metric space. We note that

$$0 =\sum\limits_{i=1}^{r}{\mu ^{\prime}}_{ i}-{\sum \nolimits }_{i=1}^{r}{\mu ^{\prime\prime}}_{ i} =\sum\limits_{i=1}^{r}({\mu ^{\prime}}_{ i}-{\mu ^{\prime\prime}}_{i}) =\sum\nolimits^{+}({\mu ^{\prime}}_{ i}-{\mu ^{\prime\prime}}_{i})-{\sum \nolimits }^{+}({\mu ^{\prime\prime}}_{ i}-{\mu ^{\prime}}_{i})\, $$

where \({\sum \nolimits }^{+}\) denotes the summation with respect to those indices i for which the terms are positive. Therefore,

$$d(\mu ^{\prime},\mu ^{\prime\prime}) = \frac{1} {2}\sum\limits_{i=1}^{r}\vert {\mu ^{\prime}}_{ i}-{\mu ^{\prime\prime}}_{i}\vert = \frac{1} {2}{\sum \nolimits }^{+}({\mu ^{\prime}}_{ i}-{\mu ^{\prime\prime}}_{i})+\frac{1} {2}{\sum \nolimits }^{+}({\mu ^{\prime\prime}}_{ i}-{\mu ^{\prime}}_{i}) =\sum\nolimits^{+}({\mu ^{\prime}}_{ i}-{\mu ^{\prime\prime}}_{i}).$$

It is also clear that \(d(\mu ^{\prime},\mu ^{\prime\prime}) \leq 1\).

Let \(\mu ^{\prime}\) and \(\mu ^{\prime\prime}\) be two probability distributions on X and \(Q = ({q}_{ij})\) a stochastic matrix. By Lemma 5.2, \(\mu ^{\prime}Q\) and \(\mu ^{\prime\prime}Q\) are also probability distributions. Let us demonstrate that

$$d(\mu ^{\prime}Q,\mu ^{\prime\prime}Q) \leq d(\mu ^{\prime},\mu ^{\prime\prime}),$$
(5.3)

and if all \({q}_{ij} \geq \alpha \), then

$$d(\mu ^{\prime}Q,\mu ^{\prime\prime}Q) \leq (1 - \alpha )d(\mu ^{\prime},\mu ^{\prime\prime}).$$
(5.4)

Let J be the set of indices j for which \({(\mu ^{\prime}Q)}_{j} - {(\mu ^{\prime\prime}Q)}_{j} > 0\). Then

$$d(\mu ^{\prime}Q,\mu ^{\prime\prime}Q) =\sum\limits_{j\in J}{(\mu ^{\prime}Q - \mu ^{\prime\prime}Q)}_{j} =\sum\limits_{j\in J}\sum\limits_{i=1}^{r}({\mu ^{\prime}}_{ i} - {\mu ^{\prime\prime}}_{i}){q}_{ij}$$
$$\leq {{\sum \nolimits }_{i}}^{+}({\mu ^{\prime}}_{ i} - {\mu ^{\prime\prime}}_{i}){\sum \nolimits }_{j\in J}{q}_{ij} \leq {{\sum \nolimits }_{i}}^{+}({\mu ^{\prime}}_{ i} - {\mu ^{\prime\prime}}_{i}) = d(\mu ^{\prime},\mu ^{\prime\prime}),$$

which proves (5.3). We now note that J can not contain all the indices j since both \(\mu ^{\prime}Q\) and \(\mu ^{\prime\prime}Q\) are probability distributions. Therefore, at least one index j is missing in the sum \({\sum \nolimits }_{j\in J}{q}_{ij}\). Thus, if all \({q}_{ij} > \alpha \), then \({\sum \nolimits }_{j\in J}{q}_{ij} < 1 - \alpha \) for all i, and

$$d(\mu ^{\prime}Q,\mu ^{\prime\prime}Q) \leq (1 - \alpha ){{\sum \nolimits }_{i}}^{+}({\mu ^{\prime}}_{ i} - {\mu ^{\prime\prime}}_{i}) = (1 - \alpha )d(\mu ^{\prime},\mu ^{\prime\prime}),$$

which implies (5.4).

Let \({\mu }_{0}\) be an arbitrary probability distribution on X and \({\mu }_{n} = {\mu }_{0}{P}^{n}\). We shall show that the sequence of probability distributions \({\mu }_{n}\) is a Cauchy sequence, that is for any \(\epsilon > 0\) there exists \({n}_{0}(\epsilon )\) such that for any \(k \geq 0\) we have \(d({\mu }_{n},{\mu }_{n+k}) < \epsilon \) for \(n \geq {n}_{0}(\epsilon )\). By (5.4),

$$d({\mu }_{n},{\mu }_{n+k}) = d({\mu }_{0}{P}^{n},{\mu }_{ 0}{P}^{n+k}) \leq (1 - \alpha )d({\mu }_{ 0}{P}^{n-s},{\mu }_{ 0}{P}^{n+k-s}) \leq \ldots $$
$$\leq {(1 - \alpha )}^{m}d({\mu }_{ 0}{P}^{n-ms},{\mu }_{ 0}{P}^{n+k-ms}) \leq {(1 - \alpha )}^{m},$$

where m is such that \(0 \leq n - ms < s\). For sufficiently large n we have \({(1 - \alpha )}^{m} < \epsilon \), which implies that \({\mu }_{n}\) is a Cauchy sequence.

Let \(\pi {=\lim }_{n\rightarrow \infty }{\mu }_{n}\). Then

$$\pi P =\lim\limits_{n\rightarrow \infty }{\mu }_{n}P =\mathop {\lim }_{n\rightarrow \infty }({\mu }_{0}{P}^{n})P =\lim\limits_{ n\rightarrow \infty }({\mu }_{0}{P}^{n+1}) = \pi. $$

Let us show that the distribution \(\pi \), such that \(\pi P = \pi \), is unique. Let \({\pi }_{1}\) and \({\pi }_{2}\) be two distributions with \({\pi }_{1} = {\pi }_{1}P\) and \({\pi }_{2} = {\pi }_{2}P\). Then \({\pi }_{1} = {\pi }_{1}{P}^{s}\) and \({\pi }_{2} = {\pi }_{2}{P}^{s}\). Therefore, \(d({\pi }_{1},{\pi }_{2}) = d({\pi }_{1}{P}^{s},{\pi }_{2}{P}^{s}) \leq (1 - \alpha )d({\pi }_{1},{\pi }_{2})\) by (5.4). It follows that \(d({\pi }_{1},{\pi }_{2}) = 0\), that is \({\pi }_{1} = {\pi }_{2}\).

We have proved that for any initial distribution \({\mu }_{0}\) the limit

$$\lim\limits_{n\rightarrow \infty }{\mu }_{0}{P}^{n} = \pi $$

exists and does not depend on the choice of \({\mu }_{0}\). Let us take \({\mu }_{0}\) to be the probability distribution which is concentrated at the point i. Then, for i fixed, \({\mu }_{0}{P}^{n}\) is the probability distribution \(({p}_{ij}^{(n)})\). Therefore, \({\lim }_{n\rightarrow \infty }{p}_{ij}^{(n)} = {\pi }_{j}\).

The proof of the fact that \({\pi }_{j} > 0\) for \(1 \leq j \leq r\) is left as an easy exercise for the reader.

Remark 5.10

Let \({\mu }_{0}\) be concentrated at the point i. Then

$$d({\mu }_{0}{P}^{n},\pi )\,=\,d({\mu }_{ 0}{P}^{n},\pi {P}^{n}) \leq \ldots \leq {(1 - \alpha )}^{m}d({\mu }_{ 0}{P}^{n-ms},\pi {P}^{n-ms})\,\leq \,{(1 - \alpha )}^{m},$$

where m is such that \(0 \leq n - ms < s\). Therefore,

$$d({\mu }_{0}{P}^{n},\pi ) \leq {(1 - \alpha )}^{\frac{n} {s} -1} \leq {(1 - \alpha )}^{-1}{\beta }^{n},$$

where \(\beta = {(1 - \alpha )}^{\frac{1} {s} } < 1\). In other words, the rate of convergence of \({p}_{ij}^{(n)}\) to the limit \({\pi }_{j}\) is exponential.

Remark 5.11

The term ergodicity comes from statistical mechanics. In our case the ergodicity of a Markov chain implies that a certain loss of memory regarding initial conditions occurs, as the probability distribution at time n becomes nearly independent of the initial distribution as \(n \rightarrow \infty \). We shall discuss further the meaning of this notion in Chap. 16.

4 Law of Large Numbers and the Entropy of a Markov Chain

As in the case of a homogeneous sequence of independent trials, we introduce the random variable \({\nu }_{i}^{n}(\omega )\) equal to the number of occurrences of the state i in the sequence \(\omega = ({\omega }_{0},\ldots, {\omega }_{n})\), that is the number of those \(0 \leq k \leq n\) for which \({\omega }_{k} = i\). We also introduce the random variables \({\nu }_{ij}^{n}(\omega )\) equal to the number of those \(1 \leq k \leq n\) for which \({\omega }_{k-1} = i,{\omega }_{k} = j\).

Theorem 5.12

Let π be the stationary distribution of an ergodic Markov chain. Then for any \(\epsilon > 0\)

$$\lim\limits_{n\rightarrow \infty }\mathrm{P}(\vert \frac{{\nu }_{i}^{n}} {n} - {\pi }_{i}\vert \geq \epsilon ) = 0,\ \ \ \mathrm{for}\ \ 1 \leq i \leq r,$$
$$\lim\limits_{n\rightarrow \infty }\mathrm{P}(\vert \frac{{\nu }_{ij}^{n}} {n} - {\pi }_{i}{p}_{ij}\vert \geq \epsilon ) = 0, \mathrm{for}\ \ 1 \leq i,j \leq r.$$

Proof

Let

$${\chi }_{i}^{k}(\omega ) = \left \{\begin{array}{ll} 1\ \ \ \text{ if}{\omega }_{k} = i, \\ 0\ \ \ \text{ if}{\omega }_{k}\neq i, \end{array} \right. $$
$${\chi }_{ij}^{k}(\omega ) = \left \{\begin{array}{ll} 1\ \ \ \text{ if}{\omega }_{k-1} = i,{\omega }_{k} = j, \\ 0\ \ \ \text{ otherwise,} \end{array} \right. $$

so that

$${\nu }_{i}^{n} =\sum\limits_{k=0}^{n}{\chi }_{ i}^{k},\ {\nu }_{ ij}^{n} =\sum\limits_{k=1}^{n}{\chi }_{ ij}^{k}.$$

For an initial distribution \(\mu \)

$$\mathrm{E}{\chi }_{i}^{k} =\sum\limits_{m=1}^{r}{\mu }_{m}{p}_{mi}^{(k)},\ \mathrm{E}{\chi }_{ ij}^{k} =\sum\limits_{m=1}^{r}{\mu }_{ m}{p}_{mi}^{(k)}{p}_{ ij}.$$

As \(k \rightarrow \infty \) we have \({p}_{mi}^{(k)} \rightarrow {\pi }_{i}\) exponentially fast. Therefore, as \(k \rightarrow \infty \),

$$\mathrm{E}{\chi }_{i}^{k} \rightarrow {\pi }_{ i},\ \mathrm{E}{\chi }_{ij}^{k} \rightarrow {\pi }_{ i}{p}_{ij}$$

exponentially fast. Consequently

$$\mathrm{E}\frac{{\nu }_{i}^{n}} {n} =\mathrm{ E}\frac{{\sum \nolimits }_{k=0}^{n}{\chi }_{i}^{k}} {n} \rightarrow {\pi }_{i},\ \mathrm{E}\frac{{\nu }_{ij}^{n}} {n} =\mathrm{ E}\frac{{\sum \nolimits }_{k=1}^{n}{\chi }_{ij}^{k}} {n} \rightarrow {\pi }_{i}{p}_{ij}.$$

For sufficiently large n

$$\{\omega : \vert \frac{{\nu }_{i}^{n}(\omega )} {n} - {\pi }_{i}\vert \geq \epsilon \} \subseteq \{ \omega : \vert \frac{{\nu }_{i}^{n}(\omega )} {n} - \frac{1} {n}\mathrm{E}{\nu }_{i}^{n}\vert \geq \frac{\epsilon } {2}\},$$
$$\{\omega : \vert \frac{{\nu }_{ij}^{n}(\omega )} {n} - {\pi }_{i}{p}_{ij}\vert \geq \epsilon \} \subseteq \{ \omega : \vert \frac{{\nu }_{ij}^{n}(\omega )} {n} - \frac{1} {n}\mathrm{E}{\nu }_{ij}^{n}\vert \geq \frac{\epsilon } {2}\}.$$

The probabilities of the events on the right-hand side can be estimated using the Chebyshev Inequality:

$$\mathrm{P}(\vert \frac{{\nu }_{i}^{n}} {n} - \frac{1} {n}\mathrm{E}{\nu }_{i}^{n}\vert \geq \frac{\epsilon } {2}) =\mathrm{ P}(\vert {\nu }_{i}^{n} -\mathrm{ E}{\nu }_{ i}^{n}\vert \geq \frac{\epsilon n} {2} ) \leq \frac{4\mathrm{Var}({\nu }_{i}^{n})} {{\epsilon }^{2}{n}^{2}}, $$
$$\mathrm{P}(\vert \frac{{\nu }_{ij}^{n}} {n} - \frac{1} {n}\mathrm{E}{\nu }_{ij}^{n}\vert \geq \frac{\epsilon } {2}) =\mathrm{ P}(\vert {\nu }_{ij}^{n} -\mathrm{ E}{\nu }_{ ij}^{n}\vert \geq \frac{\epsilon n} {2} ) \leq \frac{4\mathrm{Var}({\nu }_{ij}^{n})} {{\epsilon }^{2}{n}^{2}}. $$

Thus the matter is reduced to estimating \(\mathrm{Var}({\nu }_{i}^{n})\) and \(\mathrm{Var}({\nu }_{ij}^{n})\). If we set \({m}_{i}^{k} =\mathrm{ E}{\chi }_{i}^{k} =\sum\nolimits_{s=1}^{r}{\mu }_{s}{p}_{si}^{(k)}\), then

$$\mathrm{Var}({\nu }_{i}^{n}) =\mathrm{ E}{(\sum\limits_{k=0}^{n}({\chi }_{ i}^{k} - {m}_{ i}^{k}))}^{2} =$$
$$\mathrm{E}\sum\limits_{k=0}^{n}{({\chi }_{ i}^{k} - {m}_{i}^{k})}^{2} + 2\sum\limits_{{k}_{1}<{k}_{2}}\mathrm{E}({\chi }_{i}^{{k}_{1} } - {m}_{i}^{{k}_{1} })({\chi }_{i}^{{k}_{2} } -{m}_{i}^{{k}_{2} }).$$

Since \(0 \leq {\chi }_{i}^{k} \leq 1\), we have \(-1 \leq {\chi }_{i}^{k} - {m}_{i}^{k} \leq 1\), \({({\chi }_{i}^{k} - {m}_{i}^{k})}^{2} \leq 1\) and \({\sum \nolimits }_{k=0}^{n}\mathrm{E}{({\chi }_{i}^{k} - {m}_{i}^{k})}^{2} \leq n + 1\). Furthermore,

$$\mathrm{E}({\chi }_{i}^{{k}_{1} } - {m}_{i}^{{k}_{1} })({\chi }_{i}^{{k}_{2} } - {m}_{i}^{{k}_{2} }) =\mathrm{ E}{\chi }_{i}^{{k}_{1} }{\chi }_{i}^{{k}_{2} } - {m}_{i}^{{k}_{1} }{m}_{i}^{{k}_{2} } =$$
$$\sum\limits_{s=1}^{r}{\mu }_{ s}{p}_{si}^{({k}_{1})}{p}_{ ii}^{({k}_{2}-{k}_{1})} - {m}_{ i}^{{k}_{1} }{m}_{i}^{{k}_{2} } = {R}_{{k}_{1},{k}_{2}}.$$

By the Ergodic Theorem (see Remark 5.10),

$${m}_{i}^{k} = {\pi }_{ i} + {d}_{i}^{k},\ \ \vert {d}_{ i}^{k}\vert \leq c{\lambda }^{k},$$
$${p}_{si}^{(k)} = {\pi }_{ i} + {\beta }_{s,i}^{k},\ \ \vert {\beta }_{ s,i}^{k}\vert \leq c{\lambda }^{k},$$

for some constants \(c < \infty \) and \(\lambda < 1\). This gives

$$\vert {R}_{{k}_{1},{k}_{2}}\vert = \vert \sum\limits_{s=1}^{r}{\mu }_{ s}({\pi }_{i} + {\beta }_{s,i}^{{k}_{1} })({\pi }_{i} + {\beta }_{i,i}^{{k}_{2}-{k}_{1} }) - ({\pi }_{i} + {d}_{i}^{{k}_{1} })({\pi }_{i} + {d}_{i}^{{k}_{2} })\vert \leq $$
$${c}_{1}({\lambda }^{{k}_{1} } + {\lambda }^{{k}_{2} } + {\lambda }^{{k}_{2}-{k}_{1} })$$

for some constant \({c}_{1} < \infty \). Therefore, \({\sum \nolimits }_{{k}_{1}<{k}_{2}}{R}_{{k}_{1},{k}_{2}} \leq {c}_{2}n\), and consequently \(\mathrm{Var}({\nu }_{i}^{n}) \leq {c}_{3}n\) for some constants \({c}_{2}\) and \({c}_{3}\). The variance \(\mathrm{Var}({\nu }_{ij}^{n})\) can be estimated in the same way.

We now draw a conclusion from this theorem about the entropy of a Markov chain. In the case of a homogeneous sequence of independent trials, for large n the entropy is approximately equal to \(-\frac{1} {n}\ln \mathrm{p}(\omega )\) for typical \(\omega \), that is for \(\omega \) which constitute a set whose probability is arbitrarily close to one. In order to use this property to derive a general definition of entropy, we need to study the behavior of \(\ln \mathrm{p}(\omega )\) for typical \(\omega \) in the case of a Markov chain. For \(\omega = ({\omega }_{0},\ldots, {\omega }_{n})\) we have

$$\mathrm{p}(\omega ) = {\mu }_{{\omega }_{0}}\prod\limits_{i,j}{p}_{ij}^{{\nu }_{ij}^{n}(\omega ) } =\exp (\ln {\mu }_{{\omega }_{0}} +\sum\limits_{i,j}{\nu }_{ij}^{n}(\omega )\ln {p}_{ ij}),$$
$$\ln \mathrm{p}(\omega ) =\ln {\mu }_{{\omega }_{0}} +\sum\limits_{i,j}{\nu }_{ij}^{n}(\omega )\ln {p}_{ ij}.$$

From the Law of Large Numbers, for typical \(\omega \)

$$\frac{{\nu }_{ij}^{n}(\omega )} {n} \sim {\pi }_{i}{p}_{ij}.$$

Therefore, for such \(\omega \)

$$-\frac{1} {n}\ln \mathrm{p}(\omega ) = -\frac{1} {n}\ln {\mu }_{{\omega }_{0}} -\sum\limits_{i,j}{\nu }_{ij}^{n}(\omega )\ln {p}_{ ij} \sim -\sum\limits_{i,j}{\pi }_{i}{p}_{ij}\ln {p}_{ij}.$$

Thus it is natural to define the entropy of a Markov chain to be

$$h = -\sum\limits_{i}{\pi }_{i}\sum\limits_{j}{p}_{ij}\ln {p}_{ij}.$$

It is not difficult to show that with such a definition of h, the MacMillan Theorem remains true.

5 Products of Positive Matrices

Let \(A = ({a}_{ij})\) be a matrix with positive entries, \(1 \leq i,j \leq r\). Let \({A}^{{_\ast}} = ({a}_{ij}^{{_\ast}})\) be the transposed matrix, that is \({a}_{ij}^{{_\ast}} = {a}_{ji}\). Let us denote the entries of \({A}^{n}\) by \({a}_{ij}^{(n)}\). We shall use the Ergodic Theorem for Markov chains in order to study the asymptotic behavior of \({a}_{ij}^{(n)}\) as \(n \rightarrow \infty \). First, we prove the following:

Theorem 5.13 (Perron-Frobenius Theorem).

There exist a positive number \(\lambda \) (eigenvalue) and vectors \(e = ({e}_{1},\ldots, {e}_{r})\) and \(f = ({f}_{1},\ldots, {f}_{r})\) (right and left eigenvectors) such that

  1. 1.

    \({e}_{j} > 0,{f}_{j} > 0,\ 1 \leq j \leq r\).

  2. 2.

    \(Ae = \lambda e\) and \({A}^{{_\ast}}f = \lambda f\).

If \(Ae^{\prime} = \lambda ^{\prime}e^{\prime}\) and \({e^{\prime}}_{j} > 0\) for \(1 \leq j \leq r\), then \(\lambda ^{\prime} = \lambda \) and \(e^{\prime} = {c}_{1}e\) for some positive constant \({c}_{1}\). If \({A}^{{_\ast}}f^{\prime} = \lambda ^{\prime}f^{\prime}\) and \({f^{\prime}}_{j} > 0\) for \(1 \leq j \leq r\), then \(\lambda ^{\prime} = \lambda \) and \(f^{\prime} = {c}_{2}f\) for some positive constant \({c}_{2}\).

Proof

Let us show that there exist \(\lambda > 0\) and a positive vector e such that \(Ae = \lambda e\), that is

$$\sum\limits_{j=1}^{r}{a}_{ ij}{e}_{j} = \lambda {e}_{j},\ 1 \leq i \leq r.$$

Consider the convex set \(\mathcal{H}\) of vectors \(h = ({h}_{1},\ldots, {h}_{r})\) such that \({h}_{i} \geq 0,1 \leq i \leq r\), and \({\sum \nolimits }_{i=1}^{r}{h}_{i} = 1\). The matrix A determines a continuous transformation \(\mathcal{A}\) of \(\mathcal{H}\) into itself through the formula

$${(\mathcal{A}h)}_{i} = \frac{{\sum \nolimits }_{j=1}^{r}{a}_{ij}{h}_{j}} {{\sum \nolimits }_{i=1}^{r}\sum\nolimits_{j=1}^{r}{a}_{ij}{h}_{j}}.$$

The Brouwer Theorem states that any continuous mapping of a convex closed set in \({\mathbb{R}}^{n}\) to itself has a fixed point. Thus we can find \(e \in \mathcal{H}\) such that \(\mathcal{A}e = e\), that is,

$${e}_{i} = \frac{{\sum \nolimits }_{j=1}^{r}{a}_{ij}{e}_{j}} {{\sum \nolimits }_{i=1}^{r}\sum\nolimits_{j=1}^{r}{a}_{ij}{e}_{j}}.$$

Note that \({e}_{i} > 0\) for all \(1 \leq i \leq r\). By setting \(\lambda =\sum\nolimits_{i=1}^{r}\sum\nolimits_{j=1}^{r}{a}_{ij}{e}_{j}\), we obtain \({\sum \nolimits }_{j=1}^{r}{a}_{ij}{e}_{j} = \lambda {e}_{i},\ 1 \leq i \leq r\).

In the same way we can show that there is \(\overline{\lambda } > 0\) and a vector f with positive entries such that \({A}^{{_\ast}}f = \overline{\lambda }f\). The equalities

$$\lambda (e,f) = (Ae,f) = (e,{A}^{{_\ast}}f) = (e,\overline{\lambda }f) = \overline{\lambda }(e,f)$$

show that \(\lambda = \overline{\lambda }\).

We leave the uniqueness part as an exercise for the reader.

Let e and f be positive right and left eigenvectors, respectively, which satisfy

$$\sum\limits_{i=1}^{r}{e}_{ i} = 1\ \mathrm{and}\ \sum\limits_{i=1}^{r}{e}_{ i}{f}_{i} = 1.$$

Note that these conditions determine e and f uniquely. Let \(\lambda > 0\) be the corresponding eigenvalue. Set

$${p}_{ij} = \frac{{a}_{ij}{e}_{j}} {\lambda {e}_{i}}. $$

It is easy to see that the matrix \(P = ({p}_{ij})\) is a stochastic matrix with strictly positive entries. The stationary distribution of this matrix is \({\pi }_{i} = {e}_{i}{f}_{i}\). Indeed,

$$\sum\limits_{i=1}^{r}{\pi }_{ i}{p}_{ij} =\sum\limits_{i=1}^{r}{e}_{ i}{f}_{i}\frac{{a}_{ij}{e}_{j}} {\lambda {e}_{i}} = \frac{1} {\lambda }{e}_{j}\mathop \sum\limits_{i=1}^{r}{f}_{ i}{a}_{ij} = {e}_{j}{f}_{j} = {\pi }_{j}.$$

We can rewrite \({a}_{ij}^{(n)}\) as follows:

$${a}_{ij}^{(n)} =\sum\limits_{1\leq {i}_{1},\ldots, {i}_{n-1}\leq r}{a}_{i{i}_{1}} \cdot {a}_{{i}_{1}{i}_{2}} \cdot \ldots \cdot {a}_{{i}_{n-2}{i}_{n-1}} \cdot {a}_{{i}_{n-1}j}$$
$$= {\lambda }^{n}\sum\limits_{1\leq {i}_{1},\ldots, {i}_{n-1}\leq r}{p}_{i{i}_{1}}\cdot {p}_{{i}_{1}{i}_{2}}\cdot \ldots \cdot {p}_{{i}_{n-2}{i}_{n-1}}\cdot {p}_{{i}_{n-1}j}\cdot {e}_{i}\cdot {e}_{j}^{-1} = {\lambda }^{n}{e}_{ i}{p}_{ij}^{(n)}{e}_{ j}^{-1}.$$

The Ergodic Theorem for Markov chains gives \({p}_{ij}^{(n)} \rightarrow {\pi }_{j} = {e}_{j}{f}_{j}\) as \(n \rightarrow \infty \). Therefore,

$$\frac{{a}_{ij}^{(n)}} {{\lambda }^{n}} \rightarrow {e}_{i}{\pi }_{j}{e}_{j}^{-1} = {e}_{ i}{f}_{j}$$

and the convergence is exponentially fast. Thus

$${a}_{ij}^{(n)} \sim {\lambda }^{n}{e}_{ i}{f}_{j}\ \ \mathrm{as}\ n \rightarrow \infty. $$

Remark 5.14

One can easily extend these arguments to the case where the matrix \({A}^{s}\) has positive matrix elements for some integer \(s > 0\).

6 General Markov Chains and the Doeblin Condition

Markov chains often appear as random perturbations of deterministic dynamics. Let \((X,\mathcal{G})\) be a measurable space and \(f : X \rightarrow X\) a measurable mapping of X into itself. We may wish to consider the trajectory of a point \(x \in X\) under the iterations of f, that is the sequence \(x,f(x),{f}^{2}(x),\ldots \). However, if random noise is present, then x is mapped not to f(x) but to a nearby random point. This means that for each \(C \in \mathcal{G}\) we must consider the transition probability from the point x to the set C. Let us give the corresponding definition.

Definition 5.15

Let \((X,\mathcal{G})\) be a measurable space. A function \(P(x,C)\), \(x \in X,C \in \mathcal{G}\), is called a Markov transition function if for each fixed \(x \in X\) the function P(x,C), as a function of \(C \in \mathcal{G}\), is a probability measure defined on \(\mathcal{G}\), and for each fixed \(C \in \mathcal{G}\) the function \(P(x,C)\) is measurable as a function of \(x \in X\).

For x and C fixed, P(x, C) is called the transition probability from the initial point x to the set C. Given a Markov transition function P(x, C) and an integer \(n \in \mathbb{N}\), we can define the n-step transition function

$${P}^{n}(x,C) ={ \int \nolimits }_{X}\ldots {\int \nolimits }_{X}{ \int \nolimits }_{X}P(x,d{y}_{1})\ldots P({y}_{n-2},d{y}_{n-1})P({y}_{n-1},C).$$

It is easy to see that \({P}^{n}\) satisfies the definition of a Markov transition function.

A Markov transition function P(x, C) defines two operators:

  1. 1.

    The operator P which acts on bounded measurable functions

    $$(Pf)(x) ={ \int \nolimits }_{X}f(y)P(x,dy);$$
    (5.5)
  2. 2.

    The operator \({P}^{{_\ast}}\) which acts on the probability measures

    $$({P}^{{_\ast}}\mu )(C) ={ \int \nolimits }_{X}P(x,C)d\mu (x).$$
    (5.6)

It is easy to show (see Problem 15) that the image of a bounded measurable function under the action of P is again a bounded measurable function, while the image of a probability measure μ under \({P}^{{_\ast}}\) is again a probability measure.

Remark 5.16

Note that we use the same letter P for the Markov transition function and the corresponding operator. This is partially justified by the fact that the n-th power of the operator corresponds to the n-step transition function, that is

$$({P}^{n}f)(x) ={ \int \nolimits }_{X}f(y){P}^{n}(x,dy).$$

Definition 5.17

A probability measure \(\pi \) is called a stationary (or invariant) measure for the Markov transition function P if \(\pi = {P}^{{_\ast}}\pi \), that is

$$\pi (C) ={ \int \nolimits }_{X}P(x,C)d\pi (x)$$

for all \(C \in \mathcal{G}\).

Given a Markov transition function P and a probability measure \({\mu }_{0}\) on \((X,\mathcal{G})\), we can define the corresponding homogeneous Markov chain, that is the measure on the space of sequences \(\omega = ({\omega }_{0},\ldots, {\omega }_{n})\), \({\omega }_{k} \in X\), \(k = 0,\ldots, n\). Namely, denote by \(\mathcal{F}\) the \(\sigma \)-algebra generated by the elementary cylinders, that is by the sets of the form \(A =\{ \omega : {\omega }_{0} \in {A}_{0},{\omega }_{1} \in {A}_{1},\ldots, {\omega }_{n} \in {A}_{n}\}\) where \({A}_{k} \in \mathcal{G}\), \(k = 0,\ldots, n\). By Theorem 3.19, if we define

$$\mathrm{P}(A) ={ \int \nolimits }_{{A}_{0}\times \ldots \times {A}_{n-1}}d{\mu }_{0}({x}_{0})P({x}_{0},d{x}_{1})\ldots P({x}_{n-2},d{x}_{n-1})P({x}_{n-1},{A}_{n}),$$

there exists a measure on \(\mathcal{F}\) which coincides with \(\mathrm{P}(A)\) on the elementary cylinders. Moreover, such a measure on \(\mathcal{F}\) is unique.

Remark 5.18

We could also consider a measure on the space of infinite sequences \(\omega = ({\omega }_{0},{\omega }_{1},\ldots )\) with \(\mathcal{F}\) still being the \(\sigma \)-algebra generated by the elementary cylinders. In this case, there is still a unique measure on \(\mathcal{F}\) which coincides with \(\mathrm{P}(A)\) on the elementary cylinder sets. Its existence is guaranteed by the Kolmogorov Consistency Theorem which is discussed in Chap. 12.

We have already seen that in the case of Markov chains with a finite state space the stationary measure determines the statistics of typical ω (the Law of Large Numbers). This is also true in the more general setting which we are considering now. Therefore it is important to find sufficient conditions which guarantee the existence and uniqueness of the stationary measure.

Definition 5.19

A Markov transition function P is said to satisfy the strong Doeblin condition if there exist a probability measure \(\nu \) on \((X,\mathcal{G})\) and a function \(p(x,y)\) (the density of \(P(x,dy)\) with respect to the measure \(\nu \) ) such that

  1. 1.

    \(p(x,y)\) is measurable on \((X \times X,\mathcal{G}\times \mathcal{G})\).

  2. 2.

    \(P(x,C) ={ \int \nolimits }_{C}p(x,y)d\nu (y)\) for all \(x \in X\) and \(C \in \mathcal{G}\).

  3. 3.

    For some constant \(a > 0\) we have

    $$p(x,y) \geq a\ \ {\it { for}}\ {\it { all}}\ x,y \in X.$$

Theorem 5.20

If a Markov transition function satisfies the strong Doeblin condition, then there exists a unique stationary measure.

Proof

By the Fubini Theorem, for any measure μ the measure \({P}^{{_\ast}}\mu \) is given by the density \({\int \nolimits }_{X}d\mu (x)p(x,y)\) with respect to the measure \(\nu \). Therefore, if a stationary measure exists, it is absolutely continuous with respect to \(\nu \). Let M be the space of measures which are absolutely continuous with respect to \(\nu \). For \({\mu }_{1},{\mu }_{2} \in M\), the distance between them is defined via \(d({\mu }^{1},{\mu }^{2}) = \frac{1} {2} \int \nolimits \vert {m}^{1}(y) - {m}^{2}(y)\vert d\nu (y)\), where \({m}^{1}\) and \({m}^{2}\) are the densities of \({\mu }^{1}\) and \({\mu }^{2}\) respectively. We claim that \(M\) is a complete metric space with respect to the metric d. Indeed, M is a closed subspace of \({L}^{1}(X,\mathcal{G},\nu )\), which is a complete metric space. Let us show that the operator \({P}^{{_\ast}}\) acting on this space is a contraction.

Consider two measures \({\mu }^{1}\) and \({\mu }^{2}\) with the densities \({m}^{1}\) and \({m}^{2}\). Let \({A}^{+} =\{ y : {m}^{1}(y) - {m}^{2}(y) \geq 0\}\) and \({A}^{-} = X\setminus {A}^{+}\). Similarly let \({B}^{+} =\{ y :{ \int \nolimits }_{X}p(x,y)({m}^{1}(x) - {m}^{2}(x))d\nu (x) \geq 0\}\) and \({B}^{-} = X\setminus {B}^{+}\). Without loss of generality we can assume that \(\nu ({B}^{-}) \geq \frac{1} {2}\) (if the contrary is true and \(\nu ({B}^{+}) > \frac{1} {2}\), we can replace \({A}^{+}\) by \({A}^{-}\), \({B}^{+}\) by \({B}^{-}\) and reverse the signs in some of the integrals below).

As in the discrete case, \(d({\mu }^{1},{\mu }^{2}) ={ \int \nolimits }_{{A}^{+}}({m}^{1}(y) - {m}^{2}(y))d\nu (y)\). Therefore,

$$d({P}^{{_\ast}}{\mu }^{1},{P}^{{_\ast}}{\mu }^{2}) ={ \int \nolimits }_{{B}^{+}}[{\int \nolimits }_{X}p(x,y)({m}^{1}(x) - {m}^{2}(x))d\nu (x)]d\nu (y)$$
$$\leq {\int \nolimits }_{{B}^{+}}[{\int \nolimits }_{{A}^{+}}p(x,y)({m}^{1}(x) - {m}^{2}(x))d\nu (x)]d\nu (y)$$
$$={ \int \nolimits }_{{A}^{+}}[{\int \nolimits }_{{B}^{+}}p(x,y)d\nu (y)]({m}^{1}(x) - {m}^{2}(x))d\nu (x).$$

The last expression contains the integral \({\int \nolimits }_{{B}^{+}}p(x,y)d\nu (y)\) which we estimate as follows

$${\int \nolimits }_{{B}^{+}}p(x,y)d\nu (y) = 1 -{\int \nolimits }_{{B}^{-}}p(x,y)d\nu (y) \leq 1 - a\nu ({B}^{-}) \leq 1 -\frac{a} {2}.$$

This shows that

$$d({P}^{{_\ast}}{\mu }^{1},{P}^{{_\ast}}{\mu }^{2}) \leq (1 -\frac{a} {2})d({\mu }^{1},{\mu }^{2}).$$

Therefore \({P}^{{_\ast}}\) is a contraction and has a unique fixed point, which completes the proof of the theorem.

The strong Doeblin condition can be considerably relaxed, yet we may still be able to say something about the stationary measures. We conclude this section with a discussion of the structure of a Markov chain under the Doeblin condition. We shall restrict ourselves to formulation of results.

Definition 5.21

We say that P satisfies the Doeblin condition if there is a finite measure \(\mu \) with \(\mu (X) > 0\) , an integer n, and a positive \(\epsilon \) such that for any \(x \in X\)

$${P}^{n}(x,A) \leq 1 - \epsilon \ \ \ \ \mathrm{if}\ \ \ \mu (A) \leq \epsilon. $$

Theorem 5.22

If a Markov transition function satisfies the Doeblin condition, then the space X can be represented as the union of non-intersecting sets:

$$X =\bigcup\limits_{i=1}^{k}{E}_{ i} \bigcup \nolimits T,$$

where the sets \({E}_{i}\) (ergodic components) have the property \(P(x,{E}_{i}) = 1\) for \(x \in {E}_{i}\) , and for the set T (the transient set) we have \({\lim }_{n\rightarrow \infty }{P}^{n}(x,T) = 0\) for all \(x \in X\) . The sets \({E}_{i}\) can in turn be represented as unions of non-intersecting subsets:

$${E}_{i} =\bigcup\limits_{j=0}^{{m}_{i}-1}{C}_{ i}^{j},$$

where \({C}_{i}^{j}\) (cyclically moving subsets) have the property

$$P(x,{C}_{i}^{j+1(\mathrm{mod}\ {m}_{i})}) = 1\ \ \ \mathrm{for}\ \ x \in {C}_{ i}^{j}.$$

Note that if P is a Markov transition function on the state space X, then \(P(x,A)\), \(x \in {E}_{i}\), \(A \subseteq {E}_{i}\) is a Markov transition function on \({E}_{i}\). We have the following theorem describing the stationary measures of Markov transition functions satisfying the Doeblin condition (see “Stochastic Processes” by J.L. Doob).

Theorem 5.23

If a Markov transition function satisfies the Doeblin condition, and \(X = \bigcup\limits_{i=1}^{k}{E}_{i} \bigcup \nolimits T\) is a decomposition of the state space into ergodic components and the transient set, then

  1. 1.

    The restriction of the transition function to each ergodic component has a unique stationary measure \({\pi }_{i}\).

  2. 2.

    Any stationary measure \(\pi \) on the space \(X\) is equal to a linear combination of the stationary measures on the ergodic components:

    $$\pi =\sum\limits_{i=1}^{k}{\alpha }_{ i}{\pi }_{i}$$

    with \({\alpha }_{i} \geq 0\), \({\alpha }_{1} + \ldots + {\alpha }_{k} = 1\).

Finally, we formulate the Strong Law of Large Numbers for Markov chains (see “Stochastic Processes” by J.L. Doob).

Theorem 5.24

Consider a Markov transition function which satisfies the Doeblin condition and has only one ergodic component. Let \(\pi \) be the unique stationary measure. Consider the corresponding Markov chain (measure on the space of sequences \(\omega = ({\omega }_{0},{\omega }_{1},\ldots )\) ) with some initial distribution. Then for any function \(f \in {L}^{1}(X,\mathcal{G},\pi )\) the following limit exists almost surely:

$$\lim\limits_{n\rightarrow \infty }\frac{{\sum}_{k=0}^{n}f({\omega }_{k})} {n + 1} ={\int}_X f(x)d\pi (x).$$

7 Problems

  1. 1.

    Let P be a stochastic matrix. Prove that there is at least one non-negative vector π such that \(\pi P = \pi \).

  2. 2.

    Consider a homogeneous Markov chain on a finite state space with the transition matrix P and the initial distribution \(\mu \). Prove that for any \(0 < k < n\) the induced probability distribution on the space of sequences \(({\omega }_{k},{\omega }_{k+1},\ldots, {\omega }_{n})\) is also a homogeneous Markov chain. Find its initial distribution and the matrix of transition probabilities.

  3. 3.

    Consider a homogeneous Markov chain on a finite state space X with transition matrix \(P\) and the initial distribution \({\delta }_{x}\), \(x \in X\), that is \(\mathrm{P}({\omega }_{0} = x) = 1\). Let \(\tau \) be the first k such that \({\omega }_{k}\neq x\). Find the probability distribution of \(\tau \).

  4. 4.

    Consider the one-dimensional simple symmetric random walk (Markov chain on the state space \(\mathbb{Z}\) with transition probabilities \({p}_{i,i+1} = {p}_{i,i-1} = 1/2\)). Prove that it does not have a stationary distribution.

  5. 5.

    For a homogeneous Markov chain on a finite state space X with transition matrix P and initial distribution \(\mu \), find \(\mathrm{P}({\omega }_{n} = {x}^{1}\vert {\omega }_{0} = {x}^{2},{\omega }_{2n} = {x}^{3})\), where \({x}^{1},{x}^{2},{x}^{3} \in X\).

  6. 6.

    Consider a homogeneous ergodic Markov chain on the finite state space \(X =\{ 1,\ldots, r\}\) with the transition matrix P and the stationary distribution \(\pi \). Assuming that \(\pi \) is also the initial distribution, find the following limit

    $$\lim \limits_{n\rightarrow \infty }\frac{\ln \mathrm{P}({\omega }_{i}\neq 1\ \ \mathrm{for}\ 0 \leq i \leq n)} {n}. $$
  7. 7.

    Consider a homogeneous ergodic Markov chain on the finite state space \(X =\{ 1,\ldots, r\}\). Define the random variables \({\tau }_{n}\), \(n \geq 1\), as the consecutive times when the Markov chain is in the state 1, that is

    $${\tau }_{1} =\inf (i \geq 0 : {\omega }_{i} = 1),$$
    $${\tau }_{n} =\inf (i > {\tau }_{n-1} : {\omega }_{i} = 1),\ n > 1.$$

    Prove that \({\tau }_{1}\), \({\tau }_{2} - {\tau }_{1}\), \({\tau }_{3} - {\tau }_{2},\ldots \,\) is a sequence of independent random variables.

  8. 8.

    Consider a homogeneous ergodic Markov chain on a finite state space with the transition matrix P and the stationary distribution \(\pi \). Assuming that \(\pi \) is also the initial distribution, prove that the distribution of the inverse process \(({\omega }_{n},{\omega }_{n-1},\ldots, {\omega }_{1},{\omega }_{0})\) is also a homogeneous Markov chain. Find its matrix of transition probabilities and stationary distribution.

  9. 9.

    Find the stationary distribution of the Markov chain with the countable state space \(\{0,1,2,\ldots, n,\ldots \}\), where each point, including 0, can either return to 0 with probability 1 ∕ 2 or move to the right \(n\mapsto n + 1\) with probability 1 ∕ 2.

  10. 10.

    Let P be a matrix of transition probabilities of a homogeneous ergodic Markov chain on a finite state space such that \({p}_{ij} = {p}_{ji}\). Find its stationary distribution.

  11. 11.

    Consider a homogeneous Markov chain on the finite state space \(X =\{ 1,\ldots, r\}\). Assume that all the elements of the transition matrix are positive. Prove that for any \(k \geq 0\) and any \({x}^{0},{x}^{1},\ldots, {x}^{k} \in X\),

    $$\mathrm{P}(\mathrm{there}\ \mathrm{is}\ n\ \mathrm{such}\ \mathrm{that}\ {\omega }_{n} = {x}^{0},{\omega }_{ n+1} = {x}^{1},\ldots, {\omega }_{ n+k} = {x}^{k}) = 1.$$
  12. 12.

    Consider a Markov chain on a finite state space. Let \({k}_{1},{k}_{2},{l}_{1}\) and \({l}_{2}\) be integers such that \(0 \leq {k}_{1} < {l}_{1} \leq {l}_{2} < {k}_{2}\). Consider the conditional probabilities

    $$f({i}_{{k}_{1}},\ldots, {i}_{{l}_{1}-1},{i}_{{l}_{2}+1},\ldots, {i}_{{k}_{2}}) =$$
    $$\mathrm{P}({\omega }_{{l}_{1}}\! = {i}_{{l}_{1}},\ldots, {\omega }_{{l}_{2}}\! = {i}_{{l}_{2}}\vert {\omega }_{{k}_{1}}\! = {i}_{{k}_{1}},\ldots, {\omega }_{{l}_{1}-1} = {i}_{{l}_{1}-1},{\omega }_{{l}_{2}+1}$$
    $$= {i}_{{l}_{2}+1},\ldots, {\omega }_{{k}_{2}} = {i}_{{k}_{2}})$$

    with \({i}_{{l}_{1}}\),…,\({i}_{{l}_{2}}\) fixed. Prove that whenever f is defined, it depends only on \({i}_{{l}_{1}-1}\) and \({i}_{{l}_{2}+1}\).

  13. 13.

    Consider a Markov chain whose state space is \(\mathbb{R}\). Let \(P(x,A)\), \(x \in \mathbb{R}\), \(A \in \mathcal{B}(\mathbb{R})\), be the following Markov transition function,

    $$P(x,A) = \lambda ([x - 1/2,x + 1/2] \cap A),$$

    where λ is the Lebesgue measure. Assuming that the initial distribution is concentrated at the origin, find \(\mathrm{P}(\vert {\omega }_{2}\vert \leq 1/4)\).

  14. 14.

    Let \({p}_{ij}\), \(i,j \in \mathbb{Z}\), be the transition probabilities of a Markov chain on the state space \(\mathbb{Z}\). Suppose that

    $${p}_{i,i-1} = 1 - {p}_{i,i+1} = r(i)$$

    for all \(i \in \mathbb{Z}\), where \(r(i) = {r}_{-} < 1/2\) if \(i < 0\), \(r(0) = 1/2\), and \(r(i) = {r}_{+} > 1/2\) if \(i > 0\). Find the stationary distribution for this Markov chain. Does this Markov chain satisfy the Doeblin condition?

  15. 15.

    For a given Markov transition function, let P and \({P}^{{_\ast}}\) be the operators defined by (5.5) and (5.6), respectively. Prove that the image of a bounded measurable function under the action of P is again a bounded measurable function, while the image of a probability measure \(\mu \) under \({P}^{{_\ast}}\) is again a probability measure.

  16. 16.

    Consider a Markov chain whose state space is the unit circle. Let the density of the transition function \(P(x,dy)\) be given by

    $$p(x,y) = \left \{\begin{array}{ll} 1/(2\epsilon )\ \ \ \text{ if}\mathrm{angle\ }(y,x) < \epsilon, \\ 0\text{ otherwise,}\end{array} \right. $$

    where \(\epsilon > 0\). Find the stationary measure for this Markov chain.