Keywords

1 Introduction

In these lecture notes we will consider the following issues: Ergodic theorem (in some textbooks called Convergence theorem, while Ergodic would be reserved for what we call Law of Large Numbers—see below), Law of Large Numbers (LLN), Central Limit Theorem (CLT), Large Deviations (LDs) for Markov chains (MC), and as one of the most important applications, a Poisson equation. LLN, CLT and LDs are the basis of most of statistical applications. Everything is presented on the simplest model of a Markov chain with positive transition probabilities on a finite state space, and in some cases we show a few more general results where it does not require too much of additional efforts. This simplified version may be regarded as a preparation to more advanced situations of Markov chains on a more general state space, including non-compact ones and including Markov diffusions. A special place in this plan is occupied by coupling method, a famous idea, which is not necessary for any result in these lecture notes; yet, it is a rather convenient tool ‘for thinking’, although sometimes not very easy for a rigorous presentation. We show the Ergodic theorem firstly without and then with coupling method. Poisson equations in this paper are discrete analogues of ‘real’ Poisson equations for elliptic differential operators of the second order in mathematical physics. We consider equations without a potential—the most useful tool in diffusion approximations, cf. [12, 34, 35]—and also with a potential. The problem of smoothness of solutions with respect to a parameter—which makes this stuff so important in diffusion approximations and which is one of the main motivations of the whole theory—is not presented; however, these notes may be regarded as a bridge to this smoothness issue.

These notes are based on several different courses delivered by the author at various universities in various years, including Moscow State University, Helsinki Technical University (now Aalto University), University of Leeds and Magic consortium (http://maths-magic.ac.uk/index.php), and National Research University Higher School of Economics—Moscow. The author thanks all participants—not only students—for their interest and patience and for many useful remarks.

The initial plan involved non-compact cases with problems related to stability or recurrence properties of processes in such spaces. However, this would require significantly more time and many more pages. Hence, this more ambitious task is postponed for some future.

Some classical results are given without proofs although they were proved in the courses delivered. The references on all such ‘missing’ proofs are straightforward.

Finally, let us mention that the following numeration system is accepted here: all items such as Theorem, Lemma, Remark, Definition and some others are numbered by a unique sequence of natural numbers. This method was accepted in some well-known textbooks and the author shares the view about its convenience.

The following notations will be used for a process \((X_n, \, n\ge 0)\):

$$ \mathcal{F}^X_n = \sigma (X_k: \, k\le n); \quad \mathcal{F}^X_{(n)} = \sigma (X_n). $$

The following notations from the theory of Markov processes will be accepted (cf. [11]): the index x in \(\mathbb E_x\) or \(\mathbb P_x\) signifies the expectation or the probability measure related to the non-random initial state of the process \(X_0\). This initial state may be also random with some distribution \(\mu \), in which case notations \(\mathbb E_\mu \) or \(\mathbb P_\mu \) may be used.

If state space S is finite, then |S| denotes the number of its elements. In the sequel \(\mathcal P\) denotes the transition matrix \(\left( p_{ij}\right) _{1\le i,j \le |S|}\) of the process in the cases where state space of the process is finite.

Since this is a course about ergodic properties, we do not recall the definitions of what are Markov, strong Markov, homogeneous Markov processes (MP) which are assumed to be known to the reader: consult any of the textbooks [4, 10,11,12, 20, 27, 38, 49] if in doubt.

2 Ergodic Theorem – 1

In this section, we state and prove a simple ergodic theorem for Markov chains on a finite state space. However, we start with a more general setting because later in the end of these lecture notes a more general setting will be addressed. Ergodic Theorem for Markov chains in a simple situation of finite state spaces is due to Markov, although sometimes it is attributed to Kolmogorov with a reference to Gnedenko’s textbook, and sometimes to Doeblin (see [9, 15]). We emphasize that this approach was introduced by Markov himself (see [30, 38, 39]). Kolmogorov, indeed, has contributed to this area: see, in particular, [23].

Let us consider a homogeneous Markov chain \(X = (X_n), \, n=0,1,2,\ldots \) with a general topological state space \((S, \mathcal{S})\) where \(\mathcal{S}\) is the family of all Borel sets in S assuming that \(\mathcal{S}\) contains all single point subsets. Let \(P_x(A)\) be the transition kernel, that is, \(P_x(A) = \mathbb P(X_1\in A | X_0=x) \equiv \mathbb P_x(X_1\in A) \); recall that for any \(A \in \mathcal{S}\) this function is assumed Borel measurable in x (see [11]) and a measure in A (of course, for a finite S this is not a restriction). Denote by \(P_x(n,A)\) the n-step transition kernel, i.e. \(P_x(n,A) = \mathbb P_x(X_n\in A)\); for a finite Markov chain and if \(A=j\), the notation \(p_{ij}^{(n)}\) will be used, too. If initial state is random with distribution \(\mu \), we will be using a similar notation \(P_\mu (n,A)\) for the probability \(\mathbb P_\mu (X_n\in A)\). Repeat that \(\mathbb P_{inv}(X_n\in A)\) signifies \(P_\mu (X_n\in A)\) with the (unique) invariant measure \(\mu \); naturally, this value does not depend on n.

Recall the definition of ergodicity for Markov chains (MC).

Definition 1

An MC \((X_n)\) is called Markov ergodic iff the sequence of transition measures \((\mathbb P_x(n, \cdot ))\) has a limit in total variation metric, which is a probability measure and if, in addition, this limiting measure does not depend on x,

$$\begin{aligned} \lim _{n\rightarrow \infty } \mathbb P_x(n,A) = \mu (A), \quad \forall \; A\in \mathcal{S}. \end{aligned}$$
(1)

Recall that the total variation distance or metric between two probability measures may be defined as

$$ \Vert \mu -\nu \Vert _{TV} := 2 \sup \limits _{A \in \mathcal{S}} (\mu (A) - \nu (A)). $$

Definition 2

An MC \((X_n)\) is called irreducible iff for any \(x\in S\) and \(A\in \mathcal{S}, \, A \not = \emptyset \), there exists n such that

$$ \mathbb P_x(X_n \in A) > 0. $$

An MC \((X_n)\) is called \(\nu \)-irreducible for a given measure \(\nu \) on \((S, \mathcal{S})\) iff for any \(x\in S\) and \(A\in \mathcal{S}, \, \nu (A)>0\) there exists n such that

$$ \mathbb P_x(X_n \in A) > 0. $$

Of course, weaker or stronger ergodicity properties (definitions) may be stated with weaker, or, respectively, stronger metrics. Yet, in the finite state space case all of them are equivalent.

Exercise 3

In the case of a finite state space S with \(\mathcal{S}= 2^S\) (all subsets of S) and a counting measure \(\nu \) such that \(\nu (A) = |A| := \text{ the } \text{ number } \text{ of } \text{ elements } \text{ in } A \subset S \), show that \(\nu \)-irreducibility of a MC is equivalent to the claim that there exists \(n>0\) such that the n-step transition probability matrix \(\mathcal{P}^n\) is positive, that is, all elements of it are strictly positive.

The most standard is the notion of \(\nu \)-irreducibility of an MC where \(\nu \) is the unique invariant measure of the process.

Definition 4

Stationary or invariant probability measure \(\mu \) for a Markov process X is a measure on \(\mathcal S\) such that for each \(A\in \mathcal S\) and any n,

$$ \mu (A) = \sum _{x\in S} \mu (x) P_x(n,A). $$

Lemma 5

A probability measure \(\mu \) is stationary for X iff

$$ \mu \mathcal{P} = \mu , $$

where \(\mathcal P\) is the transition probability matrix of X.

Proof is straightforward by induction.

Lemma 6

For any (homogeneous) Markov chain in a finite state space S there exists at least one stationary measure.

Proof of the Lemma 6. The method is due to Krylov and Bogoliubov (Kryloff and Bogoliuboff, [26]). Let us fix some (any) \(i_0\in S\), and consider Cesàro averages

$$ \frac{1}{n+1}\sum _{k=0}^{n}p_{i_0,j}^{(k)}, \quad 1\le j \le N, \quad n\ge 1, $$

where \(N=|S|\). Due to the boundedness, this sequence of vectors as \(n\rightarrow \infty \) has a limit over some subsequence, say, \(n'\rightarrow \infty \),

$$ \frac{1}{n'+1}\sum _{k=0}^{n'}p_{i_0,j}^{(k)} \rightarrow \pi _j, \quad 1\le j \le N, \quad n'\rightarrow 1, $$

where by the standard convention, \(p^{(0)}_{ij} = \delta _{ij}\) (Kronecker’s symbol). Since S is finite, it follows that \((\pi _j, \, 1\le j\le N)\) is a probability distribution on S. Finally, stationarity follows from the following calculus based on Chapman–Kolmogorov’s equations,

It follows,

$$\begin{aligned} \lim _{n' \rightarrow \infty } \frac{1}{n'+1}\sum _{k=0}^{n'}p_{i_0,j}^{(k)} = \sum _{\ell = 1}^{N} \pi _\ell p_{\ell , j}^{}. \end{aligned}$$

Hence,

$$\begin{aligned} \pi _j = \sum _{\ell = 1}^{N} \pi _j p_{\ell , j}^{} \quad \sim \quad \pi = \pi \mathcal P. \end{aligned}$$

Hence, the distribution \((\pi _j)\) is stationary due to the Lemma 5. The Lemma 6 is proved.

Remark 7

Note that for a finite S the statement of the Lemma, actually, may be proved much faster by applying the Brouwer fixed-point theorem, as it is done, for example, in [41]. Yet, the method used in the proof seems deeper, and it can be used in a much more general situation including ‘non-compact’ cases. (However, we are not saying that the use of Brouwer’s fixed-point theorem is restricted to finite state spaces.)

From now on, in this and several following sections we consider the case of a finite state space S ; a more general case will be addressed in the last sections. The next condition suggested by Markov himself plays a very important role in the analysis of asymptotic behaviour of a (homogeneous) Markov chain (MC in the sequel). Let there exist \(n_0\) such that

$$\begin{aligned} \kappa _{n_0} := \inf _{i,i'} \sum _j \min (P_i(n_0,j), P_{i'}(n_0,j)) \equiv \inf _{i,i'} \sum _j \min (p^{n_0}_{i,j}, p^{n_0}_{i',j}) > 0. \end{aligned}$$
(2)

By the suggestion of S. Seneta, this coefficient \(\kappa _{n_0}\) (as well as \(\kappa _{}\) in (3) and in (52)) is properly called Markov–Dobrushin’s.

Unlike in the continuous time case, in discrete-time situation there are potential complications related to possible cycles, that is, to a periodic structure of the process. A typical example of such a periodic structure is a situation where the state space is split into two parts, \(S = S_1 \cup S_2\), which do not intersect, and \(X_{2n} \in S_1\), while \(X_{2n+1} \in S_2\) for each n. Then ergodic properties is reasonable to study separately for \(Y_n:= X_{2n}\) and for \(Z_n:= X_{2n+1}\). In other words, this complication due to periodicity does not introduce any real news, and by this reason there is a tradition to avoid this situation. Hence, in the sequel we will study our ergodic process under the assumption \(n_0=1\) in the condition (2). Similar results could be obtained under a more general assumption of aperiodicity.

So, here is the simplified version of (2), which will be accepted in the sequel:

$$\begin{aligned} \kappa := \inf _{i,i'} \sum _j \min (P_i(1,j), P_{i'}(1,j)) \equiv \inf _{i,i'} \sum _j \min (p^{}_{ij}, p^{}_{i'j}) > 0. \end{aligned}$$
(3)

Also, to clarify the ideas we will be using in some cases the following stronger assumption,

$$\begin{aligned} \kappa _0:=\inf _{ij} p_{ij} > 0. \end{aligned}$$
(4)

However, eventually, the assumption (4) will be dropped and only (3) will remain in use.

Theorem 8

Let the assumption (3) hold true. Then the process \((X_n)\) is ergodic, i.e. there exists a limiting probability measure \(\mu \) such that (1) holds true. Moreover, the uniform bound is satisfied for every n,

$$\begin{aligned} \sup _{x}\sup _{A\in \mathcal{S}} |P_x(n,A) - \mu (A)| \le (1-\kappa )^{n}, \end{aligned}$$
(5)

and the measure \(\mu \) is a unique invariant one.

Proof of Theorem 8 is classical and may be found in many places, for example, in [15].

(A) Denote for any A,

$$ m^{(n)}(A):= \min _i P_i(n,A), \quad M^{(n)}(A):= \max _i P_i(n,A). $$

By Chapman–Kolmogorov’s equation,

$$\begin{aligned}&\displaystyle m^{(n+1)}(A)= \min _i P_i(n+1,A) = \min _i \sum _j p_{ij} P_j(n,A) \\ \\&\displaystyle \ge \min _i \sum _j p_{ij} \min _{j'} P_{j'}(n,A) =m^{(n)}(A), \end{aligned}$$

which signifies that the sequence \(m^{(n)}(A)\) does not decrease in n. Similarly, the sequence \(M^{(n)}(A)\) does not increase in n. Hence, it suffices to show that

$$\begin{aligned} M^{(n)}(A) - m^{(n)}(A) \le (1-\kappa )^n. \end{aligned}$$
(6)

(B) Again by Chapman–Kolmogorov’s equation,

$$\begin{aligned} M^{(n)}(A) - m^{(n)}(A)= \max _i P_i(n,A) - \min _{i'} P_{i'}(n,A) \\ \\ = \max _i \sum _j p_{ij} P_{j}(n-1,A) - \min _{i'} \sum _j p_{i' j} P_{j}(n-1,A). \end{aligned}$$

Let maximum here be attained at \(i_+\) while minimum at \(i_-\). Then,

$$\begin{aligned}&\displaystyle M^{(n)}(A) - m^{(n)}(A)= \sum _j p_{i_+ j} P_{j}(n-1,A) - \sum _j p_{i_- j} P_{j}(n-1,A) \nonumber \\&\displaystyle = \sum _j (p_{i_+ j} - p_{i_- j}) P_{j}(n-1,A). \end{aligned}$$
(7)

(C) Denote by \(S^+\) the part of the sum in the right hand side of (7) with just \((p_{i_+ j} - p_{i_- j})\ge 0\), and by \(S^-\) the part of the sum with \((p_{i_+ j} - p_{i_- j}) < 0\). Using notations \(a_+ = a \vee 0\) and \(a_- = a\wedge 0\) (where \(a\vee b = \max (a,b)\) and \(a\wedge b = \min (a,b)\)), we estimate,

$$ S^+ \le \sum _j (p_{i_+ j} - p_{i_- j})_+ M^{(n-1)}(A) = M^{(n-1)}(A)\sum _j (p_{i_+ j} - p_{i_- j})_+, $$

and

$$ S^- \le \sum _j (p_{i_+ j} - p_{i_- j})_- m^{(n-1)}(A). $$

Therefore,

$$\begin{aligned}&\displaystyle M^{(n)}(A) - m^{(n)}(A)= S^+ + S^- \\ \\&\displaystyle \le M^{(n-1)}(A) \sum _j (p_{i_+ j} - p_{i_- j})_+ + m^{(n-1)}(A) \sum _j (p_{i_+ j} - p_{i_- j})_-. \end{aligned}$$

(D) It remains to notice that

$$ \sum _j (p_{i_+ j} - p_{i_- j})_- = - \sum _j (p_{i_+ j} - p_{i_- j})_+, $$

and

$$\begin{aligned} \sum _j (p_{i_+ j} - p_{i_- j})_+ \le 1-\kappa . \end{aligned}$$
(8)

The first follows from the normalization condition

$$ \sum _j p_{i_+ j} = \sum _j p_{i_- j} = 1, $$

while the second from (recall that \((a-b)_+ = a - a\wedge b \equiv a - \min (a,b)\) for any real values ab)

$$\begin{aligned}&\displaystyle \sum _j (p_{i_+ j} - p_{i_- j})_+ = \sum _j (p_{i_+ j} - \min (p_{i_- j},p_{i_+ j})) \\\\&\displaystyle = 1 - \sum _j \min (p_{i_- j},p_{i_+ j}) \le 1-\kappa \end{aligned}$$

(see the definition of \(\kappa \) in (3)). So, we find that

$$ M^{(n)}(A) - m^{(n)}(A) \le (1-\kappa )\,(M^{(n-1)}(A) - m^{(n-1)}(A)). $$

By induction this implies (6). So, (5) and uniqueness of the limits \(\pi _j=\lim _{n\rightarrow \infty } p^{(n)}_{ij}\) follow.

(E) The invariance of the measure \(\mu \) and uniqueness of the invariant measure follow, in turn, from (5). Indeed, let us start the process from any invariant distribution \(\mu \)—which exists due to the Lemma 6—then \(\mu _j\equiv \mathbb P_\mu (X_n=j)= \sum _{\ell }\mu _\ell p^{(n)}_{ij} \rightarrow \pi _j, \; n\rightarrow \infty \). However, the left hand side here does not depend on n. Hence, \(\mu _j=\pi _j\). The Theorem 8 is proved.

Recall that the total variation distance or metric between two probability measures may be defined as

$$ \Vert \mu -\nu \Vert _{TV} := 2 \sup _A (\mu (A) - \nu (A)). $$

Hence, the inequality (5) may be rewritten as

$$\begin{aligned} \sup _{x} \Vert P_x(n,\cdot ) - \mu (\cdot )\Vert _{TV} \le 2 (1-\kappa )^{n}. \end{aligned}$$
(9)

Corollary 9

Under the assumption of the Theorem 8, for any bounded Borel function f and for any \(0\le s < t\),

$$ \sup _x |{\mathbb E}_x (f(X_t) | X_s) - {\mathbb E}_{inv} f(X_t)| \equiv \sup _x |{\mathbb E}_x (f(X_t) - {\mathbb E}_{inv} f(X_t)| X_s) | \le C_f (1-\kappa )^{t-s}, $$

or, equivalently,

$$ \sup _x |{\mathbb E}_x (f(X_t) | \mathcal{F}^X_s) - {\mathbb E}_{inv} f(X_t)| \le C_f (1-\kappa )^{t-s}, $$

where \(C_f = \max \limits _j |f(j)| \equiv \Vert f\Vert _{B(S)}\).

This useful Corollary follows from the Theorem 8.

It is worth noting that in a general case there is a significantly weaker condition than (2) (or, in the general case weaker than (52)—see below in the Sect. 11), which also guarantees an exponential convergence rate to a unique invariant measure. We will show this condition—called Doeblin-Doob’s one—and state the corresponding famous Doeblin–Doob’s theorem on convergence, but for the proof we refer the reader to [10].

Definition 10

(DD-condition) There exist a finite (sigma-additive) measure \(\nu \ge 0\) and \(\epsilon >0\), \(s>0\) such that \(\nu (A)\le \epsilon \) implies

$$ \sup _x P_x(s, A) \le 1 - \epsilon . $$

Theorem 11

(Doeblin–Doob, without proof) If the DD-condition is satisfied for an aperiodic MP with a unique class of ergodicity (see [10]) on the state space \(\mathcal S\), then there exist \(C,c>0\) such that

$$\begin{aligned} \sup _{x}\sup _{A\in \mathcal{S}} |P_x(n,A) - \mu (A)| \le C\exp (-cn), \quad n\ge 0. \end{aligned}$$
(10)

It turns out that under the assumption (DD), the constants in the upper bound (10) cannot be effectively computed, i.e. they may be arbitrary even for the same \(\epsilon \) and \(\nu \), say. This situation dramatically differs from the case of conditions (4) and (3), where both constants in the upper bound are effectively and explicitly evaluated.

Open Question 12

It is interesting whether or not there may exist any intermediate situation with a bound like (10)—in particular, it should be uniform in the initial state—with computable constants Cc under an assumption lying somewhere in ‘between’ Markov–Dobrushin’s and Doeblin–Doob’s. Apparently, such a condition may be artificially constructed from a ‘non-compact’ theory with an exponential recurrence, but then the bounds would not be uniform in the initial data. In fact, some relatively simple version of a desired condition will be shown in the end of this text, see the Theorem 47. However, it does not totally close the problem, e.g. for non-compact spaces.

3 LLN for Homogeneous MC, Finite S

It may seem as if the Ergodic Theorem with uniform exponential convergence rate in total variation metric were all we could wish about ergodic features of the process. Yet, the statement of this theorem itself even does not include the Law of Large Numbers (LLN), which is not emphasized in most of the textbooks. However, the situation with LLN (as well as with Central Limit Theorem – CLT) is good enough, which is demonstrated below. The Theorem 13 under the assumption (4) belongs to A.A. Markov, see [30, 38].

Theorem 13

(Weak LLN) Under the assumptions of the Theorem 8, for any function f on a finite state space S,

$$\begin{aligned} \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) {\mathop {\rightarrow }\limits ^{\mathbb P}} {\mathbb E}_{inv}f(X_0), \end{aligned}$$
(11)

where \({\mathbb E}_{inv}\) stands for the expectation of \(f(X_0)\) with respect to the invariant probability measure of the process, while \(\mathbb P\) denotes the measure, which corresponds to the initial value or distribution of \(X_0\): the latter may be, or may be not stationary.

NB. Note that a simultaneous use of stationary and non-stationary measures is not a contradiction here. The initial state could be either non-random, or it may have some distribution. At the same time, the process has a unique invariant measure, and the writing \(\mathbb E_{inv}f(X_0)=0\) signifies the mere fact that \(\sum \limits _{y\in S} f(y) \mu (y) = 0\), but it is in no way in a conflict with a non-stationary initial distribution. In the next proof we use \(\mathbb P\) and, respectively, \(\mathbb E\) without specifying the initial state or distribution. However, this initial distribution (possibly concentrated at one single state) exists and it is fixed throughout the proof.

Proof of the Theorem 13. 1. First of all, note that (11) is equivalent to

$$ \frac{1}{n} \sum _{k=0}^{n-1} (f(X_k) - {\mathbb E}_{inv}f(X_0)) {\mathop {\rightarrow }\limits ^{\mathbb P}} 0, $$

so, without loss of generality we may and will assume that \( {\mathbb E}_{inv}f(X_0) = 0\). Now we estimate with any \(\epsilon > 0\) by the Bienaymé–Chebyshev–Markov inequality,

$$\begin{aligned} {\mathbb P} \left( |\frac{1}{n} \sum _{k=0}^{n-1} f(X_k)| > \epsilon \right) \le \epsilon ^{-2}n^{-2}{\mathbb E} |\sum _{k=0}^{n-1} f(X_k)|^2 \nonumber \\ \\ \nonumber = \epsilon ^{-2}n^{-2}{\mathbb E} \sum _{k=0}^{n-1} f^2(X_k) + 2\epsilon ^{-2}n^{-2}{\mathbb E} \sum _{0\le k<j\le n-1}^{} f(X_k)f(X_j). \end{aligned}$$
(12)

Here the first term, clearly (as f is bounded), satisfies,

$$ \epsilon ^{-2}n^{-2}{\mathbb E} \sum _{k=0}^{n-1} f^2(X_k) \rightarrow 0, \quad n\rightarrow \infty . $$

Let us transform the second term as follows for \(k<j\):

$$\begin{aligned} {\mathbb E} f(X_k)f(X_j) = {\mathbb E} (f(X_k) {\mathbb E} (f(X_j) | X_k)), \end{aligned}$$

and recall that due to the Corollary 9 to the Ergodic theorem,

$$ |{\mathbb E} (f(X_j) | X_k) - {\mathbb E}_{inv} f(X_j)| \le C_f (1-\kappa )^{j-k}, $$

where due to our convention \({\mathbb E}_{inv} f(X_j)=0\). Therefore, we have,

$$\begin{aligned}&\displaystyle |{\mathbb E} \sum _{k<j}^{} f(X_k)f(X_j)| = |{\mathbb E} \sum _{k<j}^{} f(X_k) {\mathbb E} (f(X_j) | X_k)| \\\\&\displaystyle \le C_f \sum _{k,j:\, 0\le k<j<n}^{}(1-\kappa )^{j-k} \le C n, \quad \text{ with } \;\; C = C_f \kappa ^{-1}. \end{aligned}$$

Thus, the second term in (12) also goes to zero as \(n\rightarrow \infty \). The Theorem 13 is proved.

Remark 14

Recall that f is bounded and exponential rate of convergence is guaranteed by the assumptions. This suffices for a strong LLN via higher moments for sums. However, it will not be used in the sequel, so we do not show it here.

4 CLT, Finite S

In this section, state space S is also finite. For the function f on S, let

$$\begin{aligned} \sigma ^2&:= {\mathbb E}_{inv}(f(X_0) -\mathbb E_{inv}f(X_0))^2 + 2 \sum _{k=1}^{\infty } {\mathbb E}_{inv}(f(X_0) \nonumber \\&\quad -\mathbb E_{inv}f(X_0))(f(X_k) - \mathbb E_{inv}f(X_k)). \end{aligned}$$
(13)

It is known that this definition provides a non-negative value (for completeness, see the two lemmata below).

Lemma 15

Under our standing assumptions (S is finite and \(\min \limits _{ij}p_{ij}>0\)),

$$\begin{aligned} \sigma ^2\ge 0, \end{aligned}$$
(14)

and, moreover,

$$\begin{aligned} n^{-1}{\mathbb E}_{inv} \left( \sum ^{n-1}_{r=0} \left( f(X_r) - \mathbb E_{inv}f(X_0)\right) \right) ^2 \rightarrow \sigma ^2, \quad n\rightarrow \infty , \end{aligned}$$
(15)

where the latter convergence is irrespectively of whether \(\sigma ^2>0\), or \(\sigma ^2=0\).

Proof

Without loss of generality, we may and will assume now that \({\mathbb E}_{inv}f(X_0) =0\) (otherwise, this mean value can be subtracted from f as in the formula (15)). Note also that in this case the variance of the random variable \(\displaystyle n^{-1/2}\sum \limits ^{n-1}_{r=0} f(X_r)\) computed with respect to the invariant measure coincides in this case with its second moment. Since \({\mathbb E}_{inv}f(X_i) = 0\) for any i, this second moment may be evaluated as follows,

$$\begin{aligned}&\displaystyle n^{-1} {\mathbb E}_{inv}(\sum ^{n-1}_{r=0} f(X_r))^2 = {\mathbb E}_{inv}f^2(X_0) + 2n^{-1} \sum _{0\le i < j\le n-1} {\mathbb E}_{inv}f(X_i)f(X_j) \\\\&\displaystyle = {\mathbb E}_{inv}f^2(X_0) + 2n^{-1} \sum _{r=1}^{n-1} (n-r) {\mathbb E}_{inv}f(X_0)f(X_r) \\\\&\displaystyle {\mathop {\rightarrow }\limits ^{\mathrm{clearly}}} {\mathbb E}_{inv}f^2(X_0) + 2 \sum _{r=1}^{\infty } {\mathbb E}_{inv}f(X_0)f(X_k)= \sigma ^2, \quad n\rightarrow \infty . \end{aligned}$$

Here the left hand side is non-negative, so \(\sigma ^2\) is non-negative, too. The Lemma 15 is proved.

Lemma 16

Under the same assumptions as in the previous Lemma, \(\sigma ^2 < \infty .\)

Proof

Again, without loss of generality, we may and will assume \(\bar{f}:= {\mathbb E}_{inv}f(X_0) =0\), and \(\Vert f\Vert _{B}\le 1\). We have, due to the Corollary 9 applied with \(\bar{f} = 0\),

$$\begin{aligned} |{\mathbb E}_{inv} f(X_0)f(X_k) | = |{\mathbb E}_{inv} f(X_0){\mathbb E}_{inv}(f(X_k) | X_0)| \le C_f{\mathbb E}_{inv} |f(X_0)| q^{k}, \end{aligned}$$

with some \(0\le q < 1\) and \(C_f = \Vert f\Vert _{B} \le 1\). So, the series in (13) does converge and the Lemma 16 is proved.

Theorem 17

Let the assumption (3) hold true. Then for any function f on S,

$$\begin{aligned} \frac{1}{\sqrt{n}} \sum _{k=0}^{n-1} (f(X_k) - {\mathbb E}_{inv}f(X_0)) {\mathop {\Longrightarrow }\limits ^{\mathbb P}} \eta \sim \mathcal{N}(0, \sigma ^2), \end{aligned}$$
(16)

where \(\Longrightarrow \) stands for the weak convergence with respect to the original probability measure (i.e. generally speaking, non-invariant).

Emphasize that we subtract the expectation with respect to the invariant measure, while weak convergence holds true with respect to the initial measure, which is not necessarily invariant. (We could have subtracted the actual expectation instead; the difference would have been negligible due to the Corollary 9.)

Remark 18

About Markov’s method in CLT the reader may consult the textbook [41]. Various approaches can be found in [1, 2, 10, 23, 31, 32, 38], et al. For a historical review see [39]. A nontrivial issue of distinguishing the cases \(\sigma ^2>0\) and \(\sigma ^2=0\) for stationary Markov chains is under discussion in [3] for finite MC where a criterion has been established for \(\sigma ^2=0\); this criterion was extended to more general cases in [24]. A simple example of irreducible aperiodic MC (with \(\min \limits _{ij}p_{ij}=0\)) and a non-constant function f where \(\sigma ^2=0\) can be found in [41, ch. 6]. Nevertheless, there is a general belief that ‘normally’ in ‘most of cases’ \(\sigma ^2>0\). (Recall that zero (a constant) is regarded as a degenerate Gaussian random variable \(\mathcal{N}(0,0)\).) On using weaker norms in CLT for Markov chains see [28].

Proof of the Theorem 17. Without loss of generality, assume that \(\Vert f\Vert _B\le 1\), and that

$$ {\mathbb E}_{inv}f(X_0) = 0. $$

I. Firstly, consider the case \(\sigma ^2 > 0\). We want to check the assertion,

$$ \mathbb E\exp \left( i\frac{\lambda }{\sqrt{n}}\sum _{r=0}^{n} f(X_r)\right) \rightarrow \exp (-\lambda ^2 \sigma ^2/2), \quad n\rightarrow \infty . $$

In the calculus below there will be expectations with respect to the measure \(\mathbb P\) (denoted by \(\mathbb E\)) and some other expectations \(\mathbb E_{inv}\). Note that they are different: the second one means expectation of a function of a random variable \(X_k\) computed with respect to the invariant measure of this process.

We are going to use Bernstein’s method of ‘corridors and windows’ (cf. [1, 2]). Let us split the interval [0, n] into partitions of two types: larger ones called ‘corridors’ and smaller ones called ‘windows’. Their sizes will increase with n as follows. Let \(\displaystyle k: = [n/[n^{3/4}]]\) be the total number of long corridors of equal length (here [a] is the integer part of \(a\in \mathbb R\)); this length will be chosen shortly as equivalent to \(n^{3/4}\). The length of each window is \(w:= [n^{1/5}]\). Now, the length of each corridor except the last one is \(c:= [n/k] - w \equiv [n/k] - [n^{1/5}]\); the last complementary corridor has the length \(c_{L} := n - k[n/k]\); note that \(c_{L} \le [n/k] \sim n^{3/4}\), \(k \sim n^{1/4}\) (i.e. \(k/n^{1/4} \rightarrow 1, \, n\rightarrow \infty \)), and \(c \sim n^{3/4}\).

The total length of all windows is then equivalent to \(w \times k \sim n^{1/5 + 1/4} = n^{9/20}\); note for the sequel that \(n^{9/20}<\!\!\!< n^{1/2}\). As was mentioned earlier, the length of the last corridor does not exceed k, and, hence, asymptotically is no more than \(n^{1/4}\) (which is much less than the length of any other corridor).

Now, denote all partial sums \(\sum \limits _{r=0}^{n} f(X_r)\) over the first k corridors by \(\eta _j, \, 1\le j\le k\). In particular,

$$ \eta _1 = \sum ^{c-1}_{r=0} f(X_r), \quad \eta _2 = \sum ^{2c+w-1}_{r=c+w} f(X_r), \;\; \text{ etc. } $$

Note that

$$ \frac{1}{\sqrt{n}}|\sum _{r=0}^{n}f(X_r) - \sum _{j=1}^k \eta _j| \le C_f \frac{(wk + k)}{\sqrt{n}} \sim C_f \frac{n^{9/20}+ n^{1/4}}{\sqrt{n}} \rightarrow 0, \quad n\rightarrow \infty , $$

uniformly in \(\omega \in \Omega \). Hence, it suffices to show that

$$ \frac{1}{\sqrt{n}}\sum _{j=1}^k \eta _j \Longrightarrow \eta ' \sim \mathcal{N}(0, \sigma ^2). $$

Note that

$$\begin{aligned} n^{-1}{\mathbb E}_{inv} \eta _1^2 \sim \frac{c}{n} \sigma ^2, \quad n\rightarrow \infty , \end{aligned}$$

or,

$$\begin{aligned} n^{-3/4}{\mathbb E}_{inv} \eta _1^2 \rightarrow \sigma ^2, \quad n\rightarrow \infty , \end{aligned}$$
(17)

and the latter convergence is irrespectively of whether \(\sigma ^2>0\), or \(\sigma ^2=0\).

Now, to show the desired weak convergence, let us check the behaviour of the characteristic functions. Due to the Corollary 9, we estimate for any \(\lambda \in \mathbb R\),

$$ |{\mathbb E} (\exp (i\lambda \eta _j) | \mathcal{F}^{X}_{(j-1)[n/k]}) - {\mathbb E}_{inv} \exp (i\lambda \eta _j)| \le C (1-\kappa )^{[n^{1/5}]}. $$

So, by induction,

$$\begin{aligned}&\displaystyle {\mathbb E} \exp \left( i\frac{\lambda }{\sqrt{n}} \sum _{j=1}^{k+1}\eta _j\right) = {\mathbb E} \exp \left( i\frac{\lambda }{\sqrt{n}} \sum _{j=1}^{k}\eta _j\right) {\mathbb E} \left( \exp \left( i\frac{\lambda }{\sqrt{n}} \eta _{k+1}\right) | \mathcal{F}^{X}_{k[n/k]}\right) \nonumber \\ \nonumber \\&\displaystyle = {\mathbb E} \exp \left( i\frac{\lambda }{\sqrt{n}} \sum _{j=1}^{k}\eta _j\right) \left( {\mathbb E}_{inv} \left( \exp (i\frac{\lambda }{\sqrt{n}} \eta _{k+1}\right) + O((1-\kappa )^{n^{1/5}})) \right) = \ldots \nonumber \\ \nonumber \\&\displaystyle = {\mathbb E}_{inv} \left( \exp \left( i\frac{\lambda }{\sqrt{n}} \eta _{k+1} + O((1-\kappa )^{n^{1/5}})\right) \right) \left( {\mathbb E}_{inv} \left( \exp \left( i\frac{\lambda }{\sqrt{n}} \eta _1\right) \right) ^k + O(k(1-\kappa )^{n^{1/5}})\right) . \nonumber \\ \end{aligned}$$
(18)

Here \(O(k(1-\kappa )^{n^{1/5}})\) is, generally speaking, random and it is a function of \(X_{k[n/k]}\), but the modulus of this random variable does not exceed a nonrandom constant multiplied by \(k(1-\kappa )^{n^{1/5}}\). We replaced \([n^{1/5}]\) by \(n^{1/5}\), which does not change the asymptotic (in)equality. Note that

$$ O(k(1-\kappa )^{n^{1/5}})) = O(n^{3/4}(1-\kappa )^{n^{1/5}})) \rightarrow 0, \quad n\rightarrow \infty . $$

Now the idea is to use Taylor’s expansion

$$\begin{aligned} {\mathbb E}_{inv} \exp \left( i\frac{\lambda }{\sqrt{n}} \eta _1\right) = 1 - \frac{\lambda ^2}{2n}n^{3/4} \sigma ^2 + R_n = 1 - \frac{\lambda ^2}{2n^{1/4}} \sigma ^2 + R_n. \end{aligned}$$
(19)

Here, to prove the desired statement it suffices to estimate accurately the remainder term \(R_n\), that is, to show that \(R_n = o(n^{-1/4}), \, n\rightarrow \infty \).

Since we, actually, transferred the problem to studying an array scheme (as \(\eta _1\) itself changes with n), we have to inspect carefully this remainder \(R_n\). Due to the Taylor expansion we have,

$$\begin{aligned} \text{ Re }\,\varphi \left( \frac{\lambda }{\sqrt{n}}\right) = {\mathbb E}_{inv} \cos \left( \frac{\lambda \eta _1}{\sqrt{n}}\right) = 1 -\frac{\lambda ^2}{2 n}\mathbb E_{inv}\eta _1^2 + \frac{\hat{\lambda }^3}{6\sqrt{n^3}} {\mathbb E}_{inv}\eta _1^3\sin (\hat{\lambda }\eta _1), \end{aligned}$$

with some \(\hat{\lambda }\) between 0 and \(\lambda \), and similarly, with some \(\tilde{\lambda }\) between 0 and \(\lambda \),

$$\begin{aligned} \text{ Im }\,\varphi \left( \frac{\lambda }{\sqrt{n}}\right) = {\mathbb E}_{inv} \sin \left( \frac{\lambda \eta _1}{\sqrt{n}}\right) = - \frac{\tilde{\lambda }^3}{6\sqrt{n^3}} {\mathbb E}_{inv} \eta _1^3\cos (\tilde{\lambda }\eta _1). \end{aligned}$$

Here in general \(\hat{\lambda }\) and \(\tilde{\lambda }\) may differ. However, this is not important in our calculus because in any case \(|\tilde{\lambda }| \le |\lambda |\) and \(|\hat{\lambda }| \le |\lambda |\). All we need to do now is to justify a bound

$$\begin{aligned} |{\mathbb E}_{inv}\eta _1^3| \le K c, \end{aligned}$$
(20)

with some non-random constant K. This is a rather standard estimation and we show the details only for completeness. (See similar in [14, 18, 22], et al.) It suffices to consider the case \(C_f \le 1\), which restriction we assume without loss of generality.

(a) Consider the case \(\mathbb E f(X_k)^3\). We have, clearly,

$$ |\sum _{k=1}^{c} \mathbb E f(X_k)^3 | \le c. $$

(b) For simplicity, denote \(f_k = f(X_k)\) and consider the case \(\mathbb E f_j f_k f_\ell , \, \ell> k>j\).

We have,

$$\begin{aligned}&\displaystyle \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j f_k f_\ell = \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j f_k \mathbb E(f_\ell | X_k) \\\\&\displaystyle = \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j f_k \psi _{k,\ell } q^{\ell -k} \quad \text{(here } \psi _{k,\ell } \in \mathcal{F}^X_{(k)}\equiv \sigma (X_k) \text{ and } |\psi _{k,\ell }|\le 1\text{) }. \end{aligned}$$

Note that, with a \(0\le q < 1\), the expression

$$ \zeta _{k}:=\sum _{\ell = k+1}^{c-1} \psi _{k,\ell } q^{\ell -k} $$

is a random variable, which modulus is bounded by the absolute constant \((1-q)^{-1}\) and which is \(\mathcal{F}^X_{(k)}\)-measurable, i.e. it may be represented as some Borel function of \(X_k\). So, we continue the calculus,

$$\begin{aligned}&\sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j f_k f_\ell = \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j f_k \zeta _{k} \\&= \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j E(f_k \zeta _{k} | X_j) = \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \mathbb E f_j (\mathbb E f_k \zeta _{k} + \zeta '_{k,j}q^{k-j}) \\&= \sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \mathbb E f_j \sum _{k=j+1}^{c-1} \zeta '_{k,j}q^{k-j} = \sum _{j=0}^{c-2} \mathbb E f_j \sum _{k=j+1}^{c-1} \zeta '_{k,j}q^{k-j}, \end{aligned}$$

due to \(\mathbb E f_j \mathbb E f_k \zeta _{k} = 0\), since \(\mathbb E f_j=0\). Here \(\zeta '_{k,j}\), in turn, for each k does not exceed by modulus the value \((1-q)^{-1}\) and is \(\mathcal{F}^X_{(j)}\)-measurable. Therefore, the inner sum in the last expression satisfies,

$$ |\sum _{k=j+1}^{c-1} \zeta '_{k,j}q^{k-j}| \le \sum _{k=j+1}^{c-1} |\zeta '_{k,j}| q^{k-j} \le (1-q)^{-1} \sum _{k=j+1}^{c-1} q^{k-j} \le (1-q)^{-2}. $$

Thus,

$$ |\sum _{j=0}^{c-2} \sum _{k=j+1}^{c-1} \sum _{\ell =k+1}^{c}\mathbb E f_j f_k f_\ell | \le (1-q)^{-2} \sum _{j=0}^{c-2} \mathbb E |f_j| \le c (1-q)^{-2}, $$

as required.

(c) Consider the terms with \(\mathbb E f(X_k)^2 f(X_\ell ), \, \ell > k\). We estimate, with some (random) \(|\psi '_{\ell ,k}|\le 1\) and \(0\le q<1\),

$$\begin{aligned} |\sum _{k<\ell }^{c-1} \mathbb E f(X_k)^2 \mathbb E(f(X_\ell ) | X_k)| = |\sum _{k<\ell < c}^{}\mathbb E f(X_k)^2 \psi '_{\ell , k} q^{\ell -k}| \le \frac{c}{1-q}. \end{aligned}$$

(d) Consider the case \(\mathbb E f(X_k)^2 f(X_\ell ), \, \ell < k\). We have similarly, for \(\ell < k\),

$$ \mathbb E f^2_k f_\ell = \mathbb E f_\ell \mathbb E(f_k^2 | X_\ell ) = \mathbb E f_\ell (\mathbb E f_k^2 + \psi _{\ell , k}'' q^{k-\ell }), \quad |\psi _{\ell , k}|\le 1, $$

with some (random) \(|\psi _{\ell , k}''|\le 1\). So, again,

$$\begin{aligned} |\sum _{\ell<k}^{c-1} \mathbb E f_\ell \mathbb E(f_k^2 | X_\ell )| = |\sum _{\ell<k<c}^{}\mathbb E f_k^2 \psi ''_{\ell , k} q^{k-\ell }| \le \frac{c}{1-q}. \end{aligned}$$

(e) Finally, collecting all intermediate bounds we obtain the bound (20), as required:

$$ |\mathbb E\eta _1^3| \le K c. $$

This implies the estimate for the remainder term \(R_n\) in (19) of the form

$$ |R_n| \le \frac{c}{n^{3/2}} \sim n^{3/4 - 3/2} = n^{-3/4} = o(n^{-1/4}), $$

as required. The last detail is to consider the term \({\mathbb E}_{inv} \exp (i\frac{\lambda }{\sqrt{n}} \eta _{k+1})\) in (18), for which we have \(\sigma ^2_{k+1}:= {\mathbb E} \eta ^2_{k+1}\) satisfying

$$\begin{aligned} {\mathbb E}_{inv} \eta _{k+1}^2 = O(n^{3/4} \sigma ^2), \quad n\rightarrow \infty . \end{aligned}$$
(21)

This term may be tackled similarly to all others, and, in any case, we get the estimate

$$ {\mathbb E}_{inv} \exp \left( i\frac{\lambda }{\sqrt{n}} \eta _{k+1}\right) = 1 + o(1), \quad n\rightarrow \infty . $$

Hence, we eventually get (recall that \(c \sim n^{3/4}\)),

$$\begin{aligned} \mathbb E\exp \left( i\frac{\lambda }{\sqrt{n}}\sum _{r=0}^{n} f(X_r)\right)&= \left( 1 - \frac{\lambda ^2 \sigma ^2 c}{2n} + O(\frac{c}{n^{3/2}})\right) ^{n^{1/4}}\\&= \left( 1 - \frac{\lambda ^2 \sigma ^2}{2n^{1/4}} + O\left( \frac{1}{n^{3/4}}\right) \right) ^{n^{1/4}} \rightarrow \exp (-\lambda ^2 \sigma ^2/2), \end{aligned}$$

which is the characteristic function for the Gaussian distribution \(\mathcal{N}(0,\sigma ^2)\), as required.

(II) The case \(\sigma ^2 = 0\) is considered absolutely similarly. Namely, with a practically identical arguments we get, now with \(\sigma ^2=0\),

$$\begin{aligned} \mathbb E\exp \left( i\frac{\lambda }{\sqrt{n}}\sum _{r=0}^{n} f(X_r)\right)&= \left( 1 - \frac{\lambda ^2 \sigma ^2 c}{2n} + O\left( \frac{c}{n^{3/2}}\right) \right) ^{n^{1/4}}\\&= \left( 1 + O\left( \frac{1}{n^{3/4}}\right) \right) ^{n^{1/4}} \rightarrow 1,\end{aligned}$$

which is the characteristic function for the degenerate Gaussian distribution \(\mathcal{N}(0,0)\), as required. Hence, the Theorem 17 is proved.

5 Coupling Method for Markov Chain: Simple Version

Concerning coupling method, it is difficult to say who exactly invented this method. The common view—shared by the author of these lecture notes—is that it was introduced by W. Doeblin [9], even though he himself refers to some ideas of Kolmogorov with relation to the study of ergodic properties of Markov chains. Leaving this subject to the historians of mathematics, let us just mention that there are quite a few articles and monographs where this method is presented [16, 29, 33, 40], et al. Also there are many papers and books where this or close method is used for further investigations without being explicitly named, see, e.g. [4]. This method itself provides ‘another way’ to establish geometric convergence in the Ergodic theorem. In the simple form as in this section, this method has limited applications; however, in a more elaborated version—see the Sect. 13 below—it is most useful, and applicable to a large variety of Markov processes including rather general diffusions, providing not necessarily geometric rates of convergence but also much weaker rates in non-compact spaces.

By simple coupling for two random variables \(X^1, X^2\) we understand the situation where both \(X^1, X^2\) are defined on the same probability space and

$$ \mathbb P(X^1 = X^2)>0. $$

Consider a Markov chain \((X_n, \, n=0, 1, \ldots )\). In fact, this simple ‘Doeblin’s’ version of coupling provides bounds of convergence which are far from optimal in most cases. (By ‘far from optimal’ we understand that the constant under the exponential is too rough.) Yet, its advantage is its simplicity and, in particular, no change of the initial probability space. From the beginning we need two ‘independent’ probability spaces, \((\Omega ^1, \mathcal{F}^1, \mathbb P^1)\), and \((\Omega ^2, \mathcal{F}^2, \mathbb P^2)\), and the whole construction runs on the direct product of those two:

$$ (\Omega , \mathcal{F}, \mathbb P):= (\Omega ^1, \mathcal{F}^1, \mathbb P^1) \times (\Omega ^2, \mathcal{F}^2, \mathbb P^2). $$

This space \((\Omega , \mathcal{F}, \mathbb P)\) will remain unchanged in this section. We assume that there are two Markov processes \((X^1_n)\) on \((\Omega ^1, \mathcal{F}^1, \mathbb P^1)\) and \((X^2_n)\) on \((\Omega ^2, \mathcal{F}^2, \mathbb P^2)\), correspondingly, with the same transition probability matrix \(\mathcal{P}=(p_{ij})_{i,j\in S}\) satisfying the ‘simple ergodic assumption’,

$$\begin{aligned} \kappa _0:= \min _{i,j}p_{ij} > 0. \end{aligned}$$
(22)

Naturally, both processes are defined on \((\Omega , \mathcal{F}, \mathbb P)\) as follows,

$$ X^1_n(\omega ) = X^1_n(\omega ^1, \omega ^2) := X^1_n(\omega ^1), \quad \& \quad X^2_n(\omega ) = X^2_n(\omega ^1, \omega ^2) := X^2_n(\omega ^2). $$

We will need some (well-known) auxiliary results. Recall that given a filtration \((\mathcal{F}_n)\), stopping time is any random variable \(\tau < \infty \) a.s. with values in \({\mathbb Z}_+\) such that for any \(n\in {\mathbb Z}_+\),

$$ (\omega : \, \tau > n) \in \mathcal{F}_n. $$

In most of textbooks on Markov chains the following Lemma may be found (see, e.g. [49]).

Lemma 19

Any Markov chain (i.e. a Markov process with discrete time) is strong Markov.

Consider a new process composed of two, \(X_ n := (X^1_n, X^2_n)\), evidently, with two independent coordinates. Due to this independence, the following Lemma holds true.

Lemma 20

The (vector-valued) process \((X_n)\) is a (homogeneous) Markov chain; hence, this chain is also strong Markov.

In the following main result of this section, \(\mu \) stands for the (unique) stationary distribution of our Markov chain \((X^1_n)\) (as well as of \((X^2_n)\)).

Theorem 21

For any initial distribution \(\mu _0\),

$$\begin{aligned} \sup _{A} |P_{\mu }(n,A) - \mu (A)| \le (1-\kappa _0)^n. \end{aligned}$$
(23)

Let us emphasize again that the bound may be not optimal; however, the advantage is that the construction of coupling here does not require any change of the probability space.

Proof of Theorem 21. Recall that a new Markov chain \(X_ n := (X^1_n, X^2_n)\) with two independent coordinates is strong Markov. Let

$$ \tau := \inf (n\ge 0: \; X^1_n = X^2_n). $$

We have seen that \(\mathbb P(\tau <\infty )=1\). More than that, from Markov property it follows for any n by induction (with a random variable called indicator, \(1(A)(\omega ) = 1\) if \(\omega \in A\) and \(1(A)(\omega ) = 0\) if \(\omega \not \in A\)),

$$\begin{aligned}&\displaystyle {\mathbb P}(\tau>n) = \mathbb E1(\tau>n) = \mathbb E\prod _{k=1}^{n}1(\tau>k) \nonumber \\&\displaystyle = \mathbb E \left( \mathbb E(\prod _{k=1}^{n}1(\tau>k) | \mathcal{F}_{n-1})\right) = \mathbb E \left( \prod _{k=1}^{n-1}1(\tau>k) \mathbb E(1(\tau>n) | \mathcal{F}_{n-1})\right) \nonumber \\&\displaystyle \le \mathbb E\prod _{k=1}^{n-1}1(\tau>k) (1-\kappa _0) = (1-\kappa _0) \mathbb E\prod _{k=1}^{n-1}1(\tau >k) \le \; {\mathop {\ldots }\limits ^{\text{(induction) }}} \; \le (1-\kappa _0)^n. \nonumber \\ \end{aligned}$$
(24)

Define

$$ X^3_n := X^1_n 1(n < \tau ) + X^2_n 1(n\ge \tau ). $$

Due to the strong Markov property, \((X^3)\) is also a Markov chain and it is equivalent to \((X^1)\)—that is, they both have the same distribution in the space of trajectories. This follows from the fact that at \(\tau \) which is a stopping time the process follows \(X^3\), so that it uses the same transition probabilities while choosing the next state at \(\tau +1\) and further.

Now, here is the most standard and most frequent calculus in most of works on coupling method, or where this method is used (recall that all the processes \(X^1, \, X^2, \, X^3\) are defined on the same probability space): for any \(A\in \mathcal{S}\),

$$\begin{aligned}&\displaystyle |{\mathbb P}(X^1_n\in A) - {\mathbb P}(X^2_n\in A)| = |{\mathbb P}(X^3_n\in A) - {\mathbb P}(X^2_n\in A)| \\\\&\displaystyle = |\mathbb E1(X^1_n\in A) - \mathbb E1(X^2_n \in A)| = |\mathbb E(1(X^3_n \in A) - 1(X^2_n\in A))| \\\\&\displaystyle = |\mathbb E(1(X^3_n\in A) - 1(X^2_n\in A)) 1(\tau>n) + \mathbb E(1(X^3_n\in A) - 1(X^2_n\in A)) 1(\tau \le n)| \\\\&\displaystyle {\mathop {=}\limits ^{(*)}} |\mathbb E(1(X^3_n\in A) - 1(X^2_n\in A)) 1(\tau>n)| \le |\mathbb E(1(X^3_n\in A) - 1(X^2_n\in A)) 1(\tau>n)| \\\\&\displaystyle \le \mathbb E|1(X^3_n\in A) - 1(X^2_n\in A)| 1(\tau>n) \le \mathbb E 1(\tau>n) = {\mathbb P}(\tau >n) {\mathop {\le }\limits ^{(24)}} (1-\kappa _0)^n. \end{aligned}$$

Note that the final bound is uniform in A. Here the equality (*) is due to the trivial fact that since \(n\ge \tau \), the values of \(X^3_n\) and \(X^2_n\) coincide, so either \(1(X^3_n\in A) = 1(X^2_n\in A) = 0\), or \(1(X^3_n\in A) = 1(X^2_n\in A) = 1\) simultaneously on each \(\omega \), which immediately implies that \((1(X^3_n\in A) - 1(X^2_n\in A)) 1(\tau \le n) = 0\). So, the Theorem 21 is proved.

6 A Bit of Large Deviations

In this section, assume

$$ {\mathbb E}_{inv} f(X_0) = 0. $$

We will be interested in the existence and properties of the limit,

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{1}{n} \ln {\mathbb E}_{x} \exp \left( \beta \sum _{k=0}^{n-1} f(X_k)\right) =: H(\beta ). \end{aligned}$$
(25)

Note that we do not use x in the right hand side because in ‘good cases’—as below—the limit does not depend on the initial state. Denote

$$ H_n(\beta ,x) := \frac{1}{n} \ln {\mathbb E}_{x} \exp \left( \beta \sum _{k=0}^{n-1} f(X_k)\right) , $$

and define the operator \(T = T^\beta \) acting on functions on S as follows,

$$ T^\beta h(x) := \exp (\beta f(x)) {\mathbb E}_x h(X_1), $$

for any function h defined on S. Note that

$$ {\mathbb E}_{x} \exp \left( \beta \sum _{k=0}^{n-1} f(X_k)\right) = (T^\beta )^n h(x), $$

with \(h(x) \equiv 1\). Indeed, for \(n=1\) this coincides with the definition of \(T^\beta \). Further, for \(n >1\) due to the Markov property by induction,

$$\begin{aligned}&\displaystyle {\mathbb E}_{x} \exp \left( \beta \sum _{k=0}^{n-1} f(X_k)\right) = {\mathbb E}_{x} {\mathbb E}_{x} \left( \exp \left( \beta \sum _{k=0}^{n-1} f(X_k)\right) | X_{1}\right) \\\\&\displaystyle = \exp (\beta f(x)) {\mathbb E}_{x} {\mathbb E}_{x}\left( \exp \left( \beta \sum _{k=1}^{n-1} f(X_k)\right) | X_{1}\right) \\\\&\displaystyle = \exp (\beta f(x)) {\mathbb E}_{x} (T^\beta )^n h(x) = T^\beta (T^\beta )^{n-1} h(x) = (T^\beta )^{n} h(x), \end{aligned}$$

as required. So, the function \(H_n\) can be rewritten as

$$ H_n(\beta ,x) = \frac{1}{n} \ln (T^\beta )^nh(x), $$

(\(h(x) \equiv 1\)). It is an easy exercise to check that the function \(H_n(\beta ,x)\) is convex in \(\beta \). Hence, if the limit exists, then the limiting function H is also convex. Now recall the following classical and basic result about positive matrices.

Theorem 22

(Perron–Frobenius) Any positive quadratic matrix (i.e. with all entries positive) has a positive eigenvalue r called its spectral radius, which is strictly greater than the moduli of the rest of the spectrum, this eigenvalue is simple, and its corresponding eigenfunction (eigenvector) has all positive coordinates.

In fact, this result under the specified conditions is due to Perron, while Frobenius extended it to non-negative matrices. We do not discuss the details of this difference and how it can be used. Various presentations may be found, in particular, in [20, 25, 38]. As an easy corollary, the Theorem 22 implies the existence of the limit in (25)—which, as was promised, does not depend on x—with,

$$\begin{aligned} H(\beta ) = \ln r(\beta ), \end{aligned}$$
(26)

where \(r(\beta )\) is the spectral radius of the operator \(T^\beta \), see, for example, [14, Theorem 7.4.2]. (Emphasize that in the proof of this theorem it is important that the eigenvector corresponding to the spectral radius is strictly positive, i.e. it has all positive components.) More than that, in our case it follows from the theorem about analytic properties of simple eigenvalues that \(r(\beta )\) is analytic, see, e.g. [21]. Therefore, \(H(\beta )\) is analytic, too. Also, clearly, analytic is \(H_n\) as a function of the variable \(\beta \). Then it follows from the properties of analytic (or convex) functions that convergence \(H_n(\beta ,x) \rightarrow H(\beta )\) implies that also

$$ H'_n(0,x) \rightarrow H'(0), \quad n\rightarrow \infty , $$

where by \(H'_n\) we understand the derivative with respect to \(\beta \). On the other hand, we have,

$$\begin{aligned}&\displaystyle H_n'(0,x) = \frac{\partial }{\partial \beta } \left( \frac{1}{n} \ln {\mathbb E}_{x} \exp (\beta \sum _{k=0}^{n-1} f(X_k))\right) |_{\beta = 0} \\\\&\displaystyle = \frac{1}{n} \, \frac{{\mathbb E}_{x} \left( \sum \limits _{k=0}^{n-1} f(X_k) \exp (\beta \sum \limits _{k=0}^{n-1} f(X_k))\right) }{{\mathbb E}_{x} \exp (\beta \sum _{k=0}^{n-1} f(X_k))}|_{\beta = 0} = \frac{1}{n} {\mathbb E}_{x} \sum \limits _{k=0}^{n-1} f(X_k). \end{aligned}$$

So, due to the Law of Large Numbers,

$$ H_n'(0,x) =\frac{1}{n} {\mathbb E}_{x} \sum _{k=0}^{n-1} f(X_k) \rightarrow {\mathbb E}_{inv} f(X_0) = 0. $$

Hence,

$$ H'(0) = {\mathbb E}_{inv} f(X_0) = 0. $$

Also, again due to the analyticity,

$$ H''_n(0,x) \rightarrow H''(0), \quad n\rightarrow \infty . $$

On the other hand, due to (17),

$$ H''_n(0) =\frac{1}{n} {\mathbb E}_{x} \left( \sum _{k=0}^{n-1} f(X_k)\right) ^2 \rightarrow \sigma ^2, \quad n\rightarrow \infty . $$

Hence,

$$ H''(0) = \sigma ^2. $$

This last assertion will not be used in the sequel.

Let us state it all as a lemma.

Lemma 23

There exists a limit \(H(\beta )\) in (25). This function H is convex and differentiable, and, in particular,

$$ H'(0) = 0, \quad H''(0) = \sigma ^2. $$

Actually, we will not use large deviations (LDs) in these lecture notes, except for the Lemma 23, which is often regarded as a preliminary auxiliary result in large deviation theory. Yet, once the title of the section uses this term, let us state one simple inequality of LD type. Recall that \({\mathbb E}_{inv}f(X_0) = 0\) in this section.

Proposition 24

Let

$$\begin{aligned} L(\alpha ) := \sup _{\beta }(\alpha \beta - H(\beta )), \quad \tilde{L}(\alpha ):= \limsup _{\delta \rightarrow 0}L(\alpha + \delta ). \end{aligned}$$
(27)

Then under the assumptions of the Ergodic Theorem 8 for any \(\epsilon > 0\),

$$\begin{aligned} \limsup _{n\rightarrow \infty } \frac{1}{n} \ln \mathbb P_x\left( \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) \ge \epsilon \right) \le - \tilde{L}(\epsilon ). \end{aligned}$$
(28)

The function L is called Legendre transformation of the function H. It is convex and lower semicontinuous; see [37] about this and more general transformations (e.g. where H is convex but not necessarily smooth—in which case L is called Fenchel–Legendre’s transformation). Notice that ‘usually’ in (28) there is a limit instead of \(\limsup \), and this limit equals the right hand side, and both \(\tilde{L}(\epsilon ) = L(\epsilon )>0\); the latter is certainly true, at least, for small \(\epsilon >0\) if \(\sigma ^2 >0\). However, this simple result does not pretend to be even an introduction to large deviations, about which theory see [5, 7, 13, 14, 17, 36, 42], et al. In the next sections the Proposition 24 will not be used: all we will need is the limit in (25) due to the Lemma 23 and some its properties, which will be specified.

Proof of Proposition 24. We have for any \(0<\delta < \epsilon \), by Chebyshev–Markov’s exponential inequality with any \(\lambda >0\),

$$\begin{aligned} \mathbb P_x\left( \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) \ge \epsilon \right) = \mathbb P_x\left( \exp \left( \lambda \sum _{k=0}^{n-1} f(X_k)\right) \ge \exp ( n\lambda \epsilon )\right) \\\\ \le \exp (-n\lambda \epsilon ) \mathbb E_x\exp \left( \lambda \sum _{k=0}^{n-1} f(X_k)\right) {\mathop {\le }\limits ^{(25)}} \exp (-n(\lambda (\epsilon -\delta ) +H(\lambda ))), \end{aligned}$$

if n is large enough. The first and the last terms here with the inequality between them can be rewritten equivalently as

$$ \frac{1}{n} \ln \mathbb P_x\left( \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) \ge \epsilon \right) \le -\lambda (\epsilon -\delta ) +H(\lambda ), $$

for n large enough. So, we have,

$$ \limsup \limits _{n\rightarrow \infty } \frac{1}{n} \ln \mathbb P_x\left( \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) \ge \epsilon \right) \le -(\lambda (\epsilon -\delta ) - H(\lambda )). $$

Since this is true for any \(\lambda >0\), we also get,

$$\begin{aligned} \limsup \limits _{n\rightarrow \infty } \frac{1}{n} \ln \mathbb P_x\left( \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) \ge \epsilon \right) \le -\sup _{\lambda >0} (\lambda (\epsilon -\delta ) -H(\lambda )), \end{aligned}$$

However, since \(H(0)=0\) and \(H'(0) = 0\) and due to the convexity of H, the supremum on all \(\lambda \in \mathbb R\) here on positive \(\epsilon - \delta \) is attained at \(\lambda >0\), i.e.

$$ \sup _{\lambda >0}(\lambda (\epsilon -\delta ) -H(\lambda )) = \sup _{\lambda \in \mathbb R}(\lambda (\epsilon -\delta ) -H(\lambda )) \equiv L(\epsilon - \delta ). $$

Thus, the left hand side in (28) does not exceed the value \(- \limsup \limits _{\delta \downarrow 0}L(\epsilon - \delta ) \le -\tilde{L}(\epsilon )\), as required. The Proposition 24 is proved.

7 Dynkin’s Formulae

Let L be a generator of some Markov chain on a finite state space S, that is, for any function u on S,

$$\begin{aligned} Lu : = \mathbb E_x u(X_1) - u(x) \equiv \mathcal{P}u(x) - u(x). \end{aligned}$$
(29)

Recall that here \(\mathcal{P}\) is the transition probability matrix of the corresponding Markov chain \((X_n)\), a function u on S is considered as a column-vector, \(\mathcal{P} u\) is this matrix multiplied by this vector, and \(\mathcal{P}u(x)\) is the x-component of the resulting vector. Note that such difference operators are discrete analogues of elliptic differential operators of the second order studied extensively, in particular, in mathematical physics. What makes them the analogues is that both are generators of Markov processes, either in discrete or in continuous time; also, it may be argued about limiting procedures approximating continuous time processes by discrete ones. Yet, the level of this comparison here is, of course, intuitive and we will not try to justify in any way, or to explain it further.

As usual in these lecture notes, we will assume that the corresponding process \((X_n)\) satisfies the Ergodic Theorem 8. The Poisson equation for the operator L from (29) is as follows:

$$\begin{aligned} Lu(x) = -f(x), \quad x\in S. \end{aligned}$$
(30)

This equation may be studied with or without some boundary and certain boundary conditions. The goal of this chapter is to present how such equations may be solved probabilistically. This simple study may be also considered as an introduction to the Poisson equations for elliptic differential operators. We start with Dynkin’s formula or Dynkin’s identity.

Theorem 25

(Dynkin’s formula 1) On the finite state space S, for any function h and any \(n = 1,2,\ldots \),

$$\begin{aligned} \mathbb E_x h(X_n) = h(x) + \sum _{k=0}^{n-1} \mathbb E_x Lh(X_k), \quad n\ge 0. \end{aligned}$$
(31)

Proof

1. For \(n=1\) the formula (31) reads,

$$ \mathbb E_x h(X_1) = h(x) + Lh(x), $$

where x is a non-random initial value of the process. Hence, by inspection, the desired identity for \(n=1\) is equivalent to the definition of the generator in (29).

2. For the general case n, the desired formula follows by induction. Indeed, assume that the formula (31) holds true for some \(n=k\) and check it for \(n=k+1\). We have,

$$\begin{aligned}&\displaystyle \mathbb E_x h(X_{n+1}) = \mathbb E_x h(X_{n+1}) - \mathbb E_x h(X_{n}) + \mathbb E_x h(X_{n}) \\\\&\displaystyle = \mathbb E_x \mathbb E_x (h(X_{n+1}) - h(X_{n}) | X_n)+ \mathbb E_x h(X_{n}) \\\\&\displaystyle = \mathbb E_x \mathbb E_x (Lh(X_n)|X_n)+ h(x) + \sum _{k=0}^{n-1} \mathbb E_x Lh(X_k) \\\\&\displaystyle = \mathbb E_x Lh(X_n) + h(x) + \sum _{k=0}^{n-1} \mathbb E_x Lh(X_k) = h(x) + \sum _{k=0}^{n} \mathbb E_x Lh(X_k). \end{aligned}$$

So, the formula (31) for all values of n follows by induction. The Theorem 25 is proved.

8 Stopping Times and Martingales: Reminder

Definition 26

Filtration \((\mathcal{F}_n, \, n=0,1,\ldots )\) is a family of increasing sigma-fields on a probability space \((\Omega , \mathcal{F}, \mathbb P)\) completed with respect to the measure \(\mathbb P\) (that is, each \(\mathcal{F}_n\) contains each subset of all \(\mathbb P\)-zero sets from \(\mathcal{F}\)). The process \((M_n)\) is called a martingale with respect to a filtration \((\mathcal{F}_n)\) iff \(\mathbb E M_n < \infty \) and \(\mathbb E(M_{n+1} |\mathcal{F}_n) = M_n\) (a.s.).

Definition 27

A random variable \(\tau < \infty \) a.s. with values in \({\mathbb Z}_+\) is called a stopping time with respect to a filtration \((\mathcal{F}_n)\) iff for each \(n\in {\mathbb Z}_+\) the event \((\tau >n)\) is measurable with respect to \(\mathcal{F}_n\).

It is recommended to read about simple properties of martingales and stopping times in one of the textbooks on stochastic processes, e.g. [27]. We will only need the following classical result about stopped martingales given here without proof.

Theorem 28

(Doob) Let \((M_n)\) be a martingale and let \(\tau \) be a stopping time with respect to a filtration \((\mathcal{F}_n)\). Then \((\tilde{M}_n:= M_{n\wedge \tau })\) is also a martingale.

In terms of martingales, the first Dynkin’s formula may be re-phrased as follows.

Theorem 29

(Dynkin’s formula 2) On the finite state space S, for any function h and any \(n = 1,2,\ldots \), the process

$$\begin{aligned} M_n:= h(X_n) - h(x) - \sum _{k=0}^{n-1} Lh(X_k), \quad n\ge 0, \end{aligned}$$
(32)

is a martingales with respect to the natural filtration \(\mathcal{F}^X_n\) ‘generated’ by the process X. Vice versa, if the process \(M_n\) from (32) is a martingale then (31) holds true.

Proof

The inverse statement is trivial. The main part follows due to the Markov property,

$$\begin{aligned} {\mathbb E}(M_n | \mathcal{F}_{n-1}) = {\mathbb E} (h(X_n) | X_{n-1}) - h(x) - \sum _{k=0}^{n-1} Lh(X_k) \\\\ = \mathcal{P}h (X_{n-1}) - Lh(X_{n-1}) - h(x) - \sum _{k=0}^{n-2} Lh(X_k) \\\\ = h(X_{n-1}) - h(x) - \sum _{k=0}^{n-2} Lh(X_k) = M_{n-1}. \end{aligned}$$

The Theorem 29 is thus proved.

Lemma 30

(Dynkin’s formula 3) Let \(\tau \) be a stopping time with

$$ \mathbb E_x\tau < \infty , \quad \forall \; x\in S. $$

Then for any function h on S,

$$ \mathbb E_x h(X_\tau ) = h(x) + \mathbb E_x\sum _{k=0}^{\tau -1} Lh (X_k). $$

Proof

Follows straightforward from the Theorem 29.

9 Poisson Equation Without a Potential

9.1 Introduction

Here we consider the following discrete Poisson equation without a potential,

$$\begin{aligned} Lu(x) \equiv \mathcal{P}u(x) -u(x) = -f(x). \end{aligned}$$
(33)

In the next section a similar discrete equation with a potential \(c = c(x), \, x\in S\), will be studied,

$$\begin{aligned} L^c u(x) := \exp (-c(x))\mathcal{P}u(x) -u(x)= -f(x), \end{aligned}$$
(34)

firstly because it is natural for PDEs—and here we present an easier but similar discrete-time theory—and secondly with a hope that it may be also useful for some further extensions, as it already happened with equations without a potential. Let \(\mu \) be, as usual, the (unique) invariant probability measure of the process \((X_n, \, n\ge 0)\).

9.2 Poisson Equation (33) with a Boundary

Firstly, we consider Poisson equation with a non-empty boundary,

$$\begin{aligned} L u (x) = -f(x), \;\; x\in S \setminus \Gamma , \quad u(x) = g(x), \; x\in \Gamma , \end{aligned}$$
(35)

where \(\Gamma \subset S\), \(\Gamma \not = \emptyset \). If the right hand side equals zero, this equation is called the Laplace equation with Dirichlet boundary conditions:

$$\begin{aligned} L u (x) = 0, \;\; x\in S \setminus \Gamma , \quad u(x) = g(x), \; x\in \Gamma . \end{aligned}$$
(36)

Let

$$ \tau := \inf (n\ge 0:\; X_n \in \Gamma ), $$

and denote

$$\begin{aligned} v(x) := \mathbb E_x \left( \sum _{k=0}^{\tau -1}f(X_k) + g(X_\tau )\right) . \end{aligned}$$
(37)

Recall that under our assumptions on the process, necessarily \({\mathbb E}_x\tau < \infty \).

For the uniqueness, we would need a maximum principle, which holds true for the Laplace equation (recall that we always assume \(\min \limits _{i,j} p_{ij}>0\)):

Lemma 31

(Maximum principle) If the function u satisfies the Eq. (36), then the maximal value (as well as minimal) of this function is necessarily attained at the boundary \(\Gamma \).

Proof

Since \(Lu(x)=0\) for any \(x \not \in \Gamma \), we have

$$\begin{aligned} u(x) = \mathcal{P}u(x), \end{aligned}$$
(38)

for such x. In other words, the value u(x) is equal to the average of the values u(y) at all other \(y\in S\) with some positive weights, due to the assumption \(\min \limits _{ij}p_{ij}>0\). However, if a maximal value, say, M, is attained by u not at the boundary, say, \(u(x_0)=M\), \(x_0 \not \in \Gamma \), and if at least one value on \(\Gamma \) (or, actually, anywhere else) is strictly less than M, then we get a contradiction, as the equality \(\sum \limits _{y \in S} p_{xy} v(y) = M\) with all \(v(y)\le M\) and with at least one \(v(y)<M\) is impossible. Similar arguments apply to the minimal value of u. This proves the Lemma 31.

Theorem 32

The function v(x) given by the formula (37) is a unique solution of the Poisson equation (35).

Proof

1. The boundary condition \(v(x) = g(x)\) on \(x \in \Gamma \) is trivial because \(\tau = 0\) in this case.

2. Let \(x\not \in \Gamma \). Then \(\tau \ge 1\). We have, due to the Markov property,

$$\begin{aligned}&\displaystyle v(x) = f(x) + \sum _{y} {\mathbb E}_x 1(X_1=y) {\mathbb E}_{y} \left( \sum _{k=0}^{\tau -1} f(X_k) + g(X_\tau )\right) \\\\&\displaystyle = f(x) + \sum _{y} p_{xy} v(y) = f(x) + {\mathbb E}_x v(X_1). \end{aligned}$$

From this, it follows clearly the statement about solving the equation,

$$ Lv(x) = {\mathbb E}_x v(X_1) - v(x) = -f(x). $$

3. Uniqueness follows from the maximum principle. Indeed, let \(v^1\) and \(v^2\) be two solutions. Then

$$ u(x):= v^1(x) - v^2(x) = 0, \quad \forall \;\; x\in \Gamma . $$

Also, at any \(x\not \in \Gamma \),

$$ Lu(x) = Lv^1(x) - Lv^2(x) = 0. $$

Hence, by virtue of the Lemma 31, both maximal and minimal values of the function u are attained at the boundary \(\Gamma \). However, at the boundary both these values are equal to zero. Therefore,

$$ u(x)= 0, \quad \forall \; x\in S, $$

that is, \(v^1-v^2 \equiv 0\), as required. This completes the proof of the Theorem 32.

9.3 Poisson Equation (33) without a Boundary

Consider the equation on the whole S,

$$\begin{aligned} L u (x) = -f(x), \;\; x\in S. \end{aligned}$$
(39)

We will need an additional assumption on f called ‘centering’. This condition is a close analogue of the subtraction in the standardization for a CLT.

Assumption 33

(Centering) It is assumed that the function f satisfies the condition,

$$\begin{aligned} {\mathbb E}_{inv}f(X_0) \equiv \sum _x f(x)\mu (x) = 0, \end{aligned}$$
(40)

where \(\mu \) is the (unique) invariant measure of the process X.

Theorem 34

Under the assumption (40), the Eq. (39) has a solution u, which is unique up to an additive constant. This solution is given by the formula

$$\begin{aligned} u(x) = \sum _{k=0}^{\infty } {\mathbb E}_{x} f(X_k). \end{aligned}$$
(41)

The solution u from (41) itself satisfies the centering condition,

$$\begin{aligned} \sum u(x) \mu (x) = 0. \end{aligned}$$
(42)

Note that the ‘educated guess’ about a solution represented by the formula (41) may be deduced from the comparison with (37) where, so to say, we want to drop the terminal summand g as there is no boundary and to replace \(\tau \) by infinity; naturally, expectation and summation should be interchanged. Also, in the present setting the idea based on considering the series for \((I- \mathcal{P})^{-1}\) on centred functions may be applied. Yet, we would like to avoid this way because in a more general ‘non-compact’ situation a polynomial convergence of the series in (41) would also suffice, and, hence, this approach looks more general.

Proof of Theorem 34. 1. Convergence. Follows straightforward from the Corollary 9. This shows that the function u(x) defined in (41) is everywhere finite.

2. Satisfying the equation. From the Markov property,

$$\begin{aligned} u(x) = f(x) + \sum _{y} {\mathbb E}_x 1(X_1=y) {\mathbb E}_{y} \sum _{k=0}^{\infty } f(X_k) \\\\ = f(x) + \sum _{y} p_{xy} v(y) = f(x) + {\mathbb E}_x u(X_1). \end{aligned}$$

From this, it follows clearly the statement,

$$ Lu(x) = {\mathbb E}_x u(X_1) - u(x) = -f(x). $$

3. Uniqueness. Let \(u^1\) and \(u^2\) be two solutions both satisfying the moderate growth and centering. Denote \(v = u^1 - u^2\). Then

$$ Lv = 0. $$

By virtue of Dynkin’s formula (31),

$$ {\mathbb E}v(X_n) - v(x) = 0. $$

However, due to the Corollary 9,

$$ {\mathbb E}_x v(X_n) \rightarrow {\mathbb E}_{inv} v(X_0) = 0. $$

Hence,

$$ v(x) \equiv 0, $$

as required.

4. Centering. We have, due to a good convergence—see the Corollary 9—and Fubini’s theorem, and since measure \(\mu \) is stationary, and finally because f is centered,

$$\begin{aligned}&\displaystyle \sum _x u(x)\mu (x) = \sum _x \mu (x)\sum _{k=0}^{\infty } {\mathbb E}_{x} f(X_k) \\\\&\displaystyle = \sum _{k=0}^{\infty }\sum _x \mu (x) {\mathbb E}_{x} f(X_k) = \sum _{k=0}^{\infty } {\mathbb E}_{inv} f(X_k) = 0. \end{aligned}$$

The Theorem 34 is proved.

10 Poisson Equation with a Potential

Let us remind the reader that the case \(|S|<\infty \) is under consideration.

10.1 Equation (34)

Recall the Eq. (34),

$$ \exp (-c(x))\mathcal{P}u(x) - u(x)= -f(x). $$

A natural candidate for the solution is the function

$$\begin{aligned} u(x) := \sum _{n=0}^\infty {\mathbb E}_x \exp \left( -\sum _{k=0}^{n-1} c(X_k) \right) f(X_k), \end{aligned}$$
(43)

provided that this expression is well-defined. Naturally on our finite state space S both f and c bounded. Denote

$$ \varphi _n := \sum \limits _{k=0}^{n} c(X_k), \quad \varphi _{-1}=0, $$

and

$$ L^c := \exp (-c(x))\mathcal{P} - I, $$

that is,

$$ L^c u(x) := \exp (-c(x))\mathcal{P}u(x) - u(x). $$

We can tackle several cases, and the most interesting one in our view is where \(c(x) = \varepsilon c_1(x)\), \(\varepsilon > 0\) small and \(\displaystyle \bar{c}_1:= \sum \limits _{x} c_1(x) \mu (x) > 0\). Denote also \(\bar{c} = \sum \limits _x c(x) \mu (x)\).

10.2 Further Dynkin’s Formulae

Lemma 35

(Dynkin’s formula 4)

$$\begin{aligned} \mathbb E_x \exp (-\varphi _{n-1})\, h(X_n) = h(x) + \sum _{k=0}^{n-1} \mathbb E_x \exp (-\varphi _{k-1}) L^{c} h(X_k). \end{aligned}$$
(44)

In other words, the process

$$\begin{aligned} M_n:= \exp (-\varphi _{n-1})\, h(X_n) - h(x) - \sum _{k=0}^{n-1} \exp (-\varphi _{k-1})L^{c} h(X_k), \quad n\ge 0, \end{aligned}$$
(45)

is a martingale.

Proof

Let the initial state x be fixed. Let us check the base, \(\underline{n=0}\). Note that \(\varphi _0 = c(x)\), \(\varphi _{-1} = 0\), and \(L^c h(x)= \exp (-\varphi _0)\mathcal{P}h(x) - h(x)\). So, for \(n=0\) the formula (44) reads,

$$ h(x) = h(x) + \sum _{k=0}^{-1} \mathbb E_x L^c h(X_k), $$

which is true due to the standard convention that \(\sum \limits _{k=0}^{-1} \cdots = 0\).

Let us check the first step, \(\underline{n=1}\):

$$\begin{aligned} \mathbb E_x \exp (-c(x))\, h(X_1)= & {} h(x) + \sum _{k=0}^{0} \mathbb E_x \exp (-\varphi _{-1})L^{c} h(X_k) \equiv h(x) \\&+ \exp (-c(x))\mathcal{P} h(x) - h(x), \end{aligned}$$

or, equivalently,

$$\begin{aligned} \mathbb E_x \exp (-c(x))\, h(X_1) = \exp (-c(x))\mathcal{P} h(x), \end{aligned}$$

which is also true.

The induction step with a general \(\underline{n\ge 1}\) follows similarly, using the Markov property. Indeed, assume that the formula (44) is true for some \(n \ge 0\). Then, for \(n+1\) we have,

$$\begin{aligned}&\displaystyle \mathbb E_x \exp (-\varphi _{n})\, h(X_{n+1}) - h(x) - \sum _{k=0}^{n} \mathbb E_x \exp (-\varphi _{k-1}) L^{c} h(X_k) \\\\&\displaystyle = \mathbb E_x \exp (-\varphi _{n})\, h(X_{n+1}) - \mathbb E_x \exp (-\varphi _{n-1})\, h(X_{n}) + \mathbb E_x \exp (-\varphi _{n-1})\, h(X_{n}) \\\\&\displaystyle - h(x) - \sum _{k=0}^{n-1} \mathbb E_x \exp (-\varphi _{k-1}) L^{c} h(X_k) - \mathbb E_x \exp (-\varphi _{n-1}) L^{c} h(X_n) \\\\&\displaystyle = \mathbb E_x \exp (-\varphi _{n})\, h(X_{n+1}) - \mathbb E_x \exp (-\varphi _{n-1})\, h(X_{n}) - \mathbb E_x \exp (-\varphi _{n-1}) L^{c} h(X_n) \\\\&\displaystyle = \mathbb E_x \left[ \mathbb E_x \left( \exp (-\varphi _{n})\, h(X_{n+1}) - \exp (-\varphi _{n-1})\, h(X_{n}) - \exp (-\varphi _{n-1}) L^{c} h(X_n) | \mathcal{F}_n\right) \right] \\\\&\displaystyle = \mathbb E_x \exp (-\varphi _{n-1}) \left[ \mathbb E_x \left( \exp (-c(X_{n}))\, h(X_{n+1}) - h(X_{n}) - L^{c} h(X_n) | X_n\right) \right] =0, \end{aligned}$$

by definition of \(L^c\). This completes the induction step, so the Lemma 44 is proved.

Lemma 36

(Dynkin’s formula 5) Let \(\tau \) be a stopping time with

$$ \mathbb E_x e^{\alpha \tau } < \infty , \quad \forall \; x\in S, $$

for some \(\alpha >0\). Then for any function h on S and if \(c=\epsilon c_1\) and \(\epsilon \) is small enough,

$$ \mathbb E_x e^{-\varphi _{\tau -1}} h(X_\tau ) = h(x) + \mathbb E_x\sum _{k=0}^{\tau -1} e^{-\varphi _{k-1}} L^c h (X_k). $$

Recall that for an irreducible Markov chain with values in a finite state space any hitting time has some finite exponential moment. This will be used in the next subsection.

Proof

We conclude from (44), or (45), due to Doob’s theorem about stopped martingales,

$$ \mathbb E_x e^{-\varphi _{(\tau -1)\wedge n}} h(X_{\tau \wedge n}) = h(x) + \mathbb E_x\sum _{k=0}^{(\tau -1)\wedge n} e^{-\varphi _{k-1}} L^c h (X_k). $$

Now if \(\epsilon \) is small enough, then we may pass to the limit as \(n\rightarrow \infty \), due to the Lebesgue theorem about a limit under the uniform integrability condition. We have,

$$ \mathbb E_x e^{-\varphi _{(\tau -1)\wedge n}} h(X_{\tau \wedge n}) \rightarrow \mathbb E_x e^{-\varphi _{\tau -1}} h(X_{\tau }), $$

and

$$ h(x) + \mathbb E_x\sum _{k=0}^{(\tau -1)\wedge n} e^{-\varphi _{k-1}} L^c h (X_k) \rightarrow h(x) + \mathbb E_x\sum _{k=0}^{\tau -1} e^{-\varphi _{k-1}} L^c h (X_k), \quad n\rightarrow \infty , $$

as required. The Lemma 36 is proved.

10.3 Poisson Equation with a Potential with a Boundary

Recall the equation with the boundary:

$$\begin{aligned} \exp (-c(x))\mathcal{P}u(x) - u(x)= -f(x), \; x\in S\setminus \Gamma , \quad u(x) = g(x), \; x\in \Gamma , \end{aligned}$$
(46)

with a boundary \(\Gamma \not = \emptyset \). The natural candidate for the solution is the function

$$\begin{aligned} u(x) := {\mathbb E}_x \left( \sum _{n=0}^{\tau -1} \exp \left( -\varphi _{n-1} \right) f(X_n) + \exp \left( -\varphi _{\tau -1} \right) g(X_\tau )\right) , \end{aligned}$$
(47)

\(\tau = \inf (n\ge 0:\, X_n\in \Gamma )\). If \(x\in \Gamma \), then \(\tau = 0\), and we agree that the term \(\sum \limits _{k=0}^{-1}\) equals zero.

Theorem 37

If the expectation in (47) is finite then the function u(x) is a unique solution of the equation (46).

Recall that \(\tau \) does have some exponential moment, so if \(c=\epsilon c_1\) as in the statement of the Lemma 36, and if \(\epsilon \) is small enough, then the expression in (47), indeed, converges.

The proof of the Theorem 37 can be established similarly to the proof of the Theorem 32. Firstly, if \(x\in \Gamma \), then clearly \(\tau = 0\), so that \(u(x) = g(x)\). Secondly, if \(x\not \in \Gamma \), then clearly \(\tau \ge 1\). Then, due to the Markov property and by splitting the sum, i.e. taking a sum \(\sum \limits _{k=1}^{\tau -1}\) and separately considering the term corresponding to \(n=0\) which is just f(x), we have,

$$\begin{aligned}&\displaystyle u(x) = f(x) + {\mathbb E}_x \left( \sum _{n=1}^{\tau -1} \exp \left( -\sum _{k=0}^{n-1} c(X_k)\right) f(X_k) + \exp \left( -\sum _{k=0}^{\tau -1} c(X_k)\right) g(X_\tau )\right) \\\\&\displaystyle = f(x) + {\mathbb E}_x {\mathbb E}_x \left[ \sum _{n=1}^{\tau -1} \exp \left( -\sum _{k=0}^{n-1} c(X_k)\right) f(X_k) + \exp \left( -\sum _{k=0}^{\tau -1} c(X_k)\right) g(X_\tau ) | X_1\right] \\\\&\displaystyle = f(x) + {\mathbb E}_x \exp (-c(x)){\mathbb E}_{X_1} \left( \sum _{n=0}^{\tau -1} \exp \left( -\sum _{k=0}^{n-1} c(X_k)\right) f(X_k) + \exp \left( -\sum _{k=0}^{\tau -1} c(X_k)\right) g(X_\tau )\right) \\\\&\displaystyle = f(x) + \exp (-c(x)) {\mathbb E}_x u(X_1) = f(x) + \exp (-c(x)) \mathcal{P} u (x), \end{aligned}$$

which shows exactly the Eq. (46), as required. The Theorem 37 is proved.

10.4 Poisson Equation with a Potential Without a Boundary

Recall the Eq. (34):

$$ \exp (-c(x))\mathcal{P}u(x) - u(x)= -f(x), $$

and the natural candidate for the solution ‘in the whole space’ is the function

$$\begin{aligned} u(x) := \sum _{n=0}^\infty \mathbb E_x \exp \left( -\sum _{k=0}^{n-1} c(X_k) \right) f(X_k), \end{aligned}$$
(48)

The main question here is the question of convergence. As was mentioned earlier, we are interested in the following case: \(c(x) = \epsilon c_1(x)\), \(\epsilon >0\) small, and \(\sum c_1(x) \mu (x) > 0\), where \(\mu \) is the unique invariant measure of the Markov chain X.

10.5 Convergence

The first goal is to justify that \(u\) is well-defined. Recall that we want to show convergence of the series,

$$ u(x) = \sum _{n=0}^\infty {\mathbb E}_x \exp \left( -\sum _{k=0}^{n-1} c(X_k)\right) f(X_n), $$

with \( c(x) = \epsilon c_1(x)\), \(\bar{c}_1=\sum c_1(x) \mu (x) > 0,\) with \(\epsilon >0 \) small. Denote

$$ H_n(\beta , x) := n^{-1} \ln {\mathbb E}_x \exp \left( \beta \sum _{k=0}^{n-1} c_1(X_k)\right) , \quad \beta \in \mathbb R^1, $$

or, equivalently,

$$ {\mathbb E}_x \exp \left( \beta \sum _{k=0}^{n-1} c_1(X_k)\right) = {\mathbb E}_x \exp (n\,H_n(\beta ,x)). $$

(Note that this notation just slightly differs from how the function \(H_n\)—and in the next formula also H—was defined in the Sect. 6: now it is constructed via the ‘additive functional’ related to another function \(c_1\). Yet, the meaning is similar, so that there is no need to change this standard notation.) Let

$$ H(\beta ) := \lim _{T\rightarrow \infty } H_T(\beta ,x), \quad \beta \in \mathbb R^1. $$

As we have seen in the Sect. 6, this limit does exist for all values of \(\beta \). (The fact that in the Sect. 6 this was shown for another function and under the centering condition for that function is of no importance because the average may be always subtracted.)

Also, it may be proved—left as an exercise to the reader (here some Lemma from [25] about estimating the spectral radius may be useful)—that if \(\delta >0\) then there exists \(n(\delta )\) such that uniformly in x

$$\begin{aligned} \sup _{|\beta | \le B} |H(\beta ) - H_n(\beta , x)| \le \delta , \quad n\ge n(\delta ). \end{aligned}$$
(49)

Unlike in the Sect. 6 where it was assumed that \(\bar{f} = 0\), here we compute,

$$ H'_n(0,x) = n^{-1} {\mathbb E}_x \sum _{k=0}^{n-1} c_1(X_k), $$

where, as usual, the notation \(H'_n(0,x)\) stands for \(\partial H'_n(\beta ,x)/\partial \beta |_{\beta = 0}\). Now, due to the Corollary 9 it follows,

$$ \lim _{n\rightarrow \infty } n^{-1} {\mathbb E}_x \sum _{k=0}^{n-1} c_1(X_k) = \bar{c}_1 = \langle c_1 , \mu \rangle >0. $$

This means that in our case

$$ H'(0) = \bar{c}_1 > 0, $$

and that, at least, in some neighbourhood of zero,

$$ \begin{aligned} H(\beta )>0, \quad \beta >0, \qquad \& \qquad H(\beta )<0, \quad \beta <0. \end{aligned}$$
(50)

Now, convergence of the sum defining u for each x for \(\epsilon >0\) small enough and uniformly in x—recall that \(|S|<\infty \)—follows from (50). Indeed, choose \(\epsilon >0\) so that \(H(-\epsilon ) < 0\) and for a fixed \(\delta = - H(-\epsilon )/2\) also choose \(n_0\) such that \(|H_n(-\epsilon , x) - H(-\epsilon )| < \delta \), for all \(n\ge n_0\) and any x. We estimate, for \(\epsilon \) small and any x (and with \(\epsilon \) independent of x),

$$\begin{aligned} |u(x)| \le \Vert f\Vert _B \, \sum _{n=0}^\infty {\mathbb E}_x \exp \left( -\varepsilon \sum _{k=0}^{n-1} c_1(X_k)\right) = \Vert f\Vert _B \, \sum _{n=0}^\infty \exp (n H_n(-\epsilon ,x)) \\\\ \le \Vert f\Vert _B \, \sum _{n=0}^\infty \exp (n (H(-\epsilon ) + \delta )) \le \Vert f\Vert _B \, \exp (\delta ) \sum _{n=0}^\infty \exp (n H(-\epsilon )) < \infty . \end{aligned}$$

10.6 u Solves the Equation

Let us argue why the function u solves the Poisson equation (34). By the Markov property,

$$\begin{aligned} u(x) = f(x) + \exp (-c(x))\sum _{y} {\mathbb E}_x 1(X_1=y) {\mathbb E}_{y} \sum _{k=0}^{\infty } \exp (-\varphi _{k-1}) f(X_k) \\\\ = f(x) + \exp (-c(x))\sum _{y} p_{xy} v(y) = f(x) + \exp (-c(x)){\mathbb E}_x u(X_1). \end{aligned}$$

From this, it follows clearly that, as required,

$$ L^cu(x) = \exp (-c(x)){\mathbb E}_x u(X_1) - u(x) = -f(x). $$

10.7 Uniqueness of Solution

Uniqueness may be shown in a standard manner. For the difference of two solutions \(v = u^1 - u^2\) we have \(L^c v = 0\). Therefore, we get,

$$ v(x) = \exp (-c(x)) {\mathbb E}_x v(X_1). $$

After iterating this formula by induction n times, we obtain,

$$ v(x) = {\mathbb E}_x \exp \left( - \sum _{k=0}^{n-1}c(X_k)\right) v(X_n). $$

Recall that the function v is necessarily bounded on a finite state space S. Hence, it follows that \(v(x) \equiv 0\). Indeed, we estimate,

$$ |v(x)| = |{\mathbb E}_x \exp \left( -\sum _{k=0}^{n-1} c(X_k)\right) v(X_n)| \le C {\mathbb E}_x \exp \left( -\sum _{k=0}^{n-1} c(X_k)\right) . $$

Hence, we get, for any \(n\ge 0\),

$$\begin{aligned} |v(x)| \le C {\mathbb E}_x \exp \left( -\epsilon \sum _{k=0}^{n-1} c_1(X_k)\right) = C \exp (n H^{}_n(-\epsilon ,x)). \end{aligned}$$
(51)

Recall that \(H^{}_n(\beta , x) \rightarrow H^{}(\beta ), \, n\rightarrow \infty \), and that \(H^{}(-\epsilon ) < 0\) for \(\epsilon >0\) small enough. So, the right hand side in (51) converges to zero exponentially fast with \(n\rightarrow \infty \). Since the left hand side does not depend on n, we get \(|v(x)|=0\), i.e. \(u^1 \equiv u^2\), as required.

11 Ergodic Theorem, General Case

Now let us consider a more general construction on a more general state space. It is assumed that

$$\begin{aligned} \kappa := \inf _{x,x'} \int \left( \frac{P_{x'}(1,dy)}{P_x(1,dy)}\wedge 1 \right) P_x(1,dy) > 0. \end{aligned}$$
(52)

Note that here \(\displaystyle \frac{P_{x'}(1,dy)}{P_x(1,dy)}\) is understood in the sense of the density of the absolute continuous components. For brevity we will be using a simplified notation \(P_x(dz)\) for \(P_x(1,dz)\). Another slightly less general condition will be accepted in the next section but it is convenient to introduce it here: suppose that there exists a measure \(\Lambda \) with respect to which each measure \(P_x(1,dz)\) for any x is absolutely continuous,

$$\begin{aligned} P_x(1,dz)<\!\!\!< \Lambda (dz), \qquad \forall \, x \in S. \end{aligned}$$
(53)

Under the assumption (53) we have another representation of the constant \(\kappa \) from (52).

Lemma 38

Under the assumption (53), we have the following representation for the constant from (52),

$$\begin{aligned} \kappa = \inf _{x,x'} \int \left( \frac{P_{x'}(1,dy)}{\Lambda (dy)}\wedge \frac{P_{x}(1,dy)}{\Lambda (dy)} \right) \Lambda (dy). \end{aligned}$$
(54)

Proof

Firstly, note that clearly the right hand side in (38) does not depend on any particular measure \(\Lambda \), i.e. for any other measure with respect to which both \(P_{x'}(1,dy)\) and \(P_{x}(1,dy)\) are absolutely continuous the formula (52) gives the same result. Indeed, it follows straightforward from the fact that if, say, \(d\Lambda<\!\!\!< d\tilde{\Lambda }\) and \(d\Lambda = f d\tilde{\Lambda }\), then we get,

$$\begin{aligned}&\int \left( \frac{P_{x'}(1,dy)}{\Lambda (dy)}\wedge \frac{P_{x}(1,dy)}{\Lambda (dy)} \right) \Lambda (dy)\\&\qquad \qquad = \int \left( \frac{P_{x'}(1,dy)}{f\tilde{\Lambda }(dy)}\wedge \frac{P_{x}(1,dy)}{f\tilde{\Lambda }(dy)} \right) f(y) 1(f(y)>0)\tilde{\Lambda }(dy) \\&\qquad \qquad = \int \left( \frac{P_{x'}(1,dy)}{\tilde{\Lambda }(dy)}\wedge \frac{P_{x}(1,dy)}{\tilde{\Lambda }(dy)} \right) 1(f(y)>0)\tilde{\Lambda }(dy). \end{aligned}$$

However, \(P_{x'}(1,dy)<\!\!\!< \Lambda (dy) = f(y) \tilde{\Lambda }(dy)\), so for any measurable A we have \(\int _A P_{x'}(1,dy) 1(f(y)=0) = 0\) and the same for \(P_x(1,dy)\), which means that, actually,

$$\begin{aligned}&\int \left( \frac{P_{x'}(1,dy)}{\tilde{\Lambda }(dy)}\wedge \frac{P_{x}(1,dy)}{\tilde{\Lambda }(dy)} \right) 1(f(y)>0)\tilde{\Lambda }(dy) \\&\qquad \qquad = \int \left( \frac{P_{x'}(1,dy)}{\tilde{\Lambda }(dy)}\wedge \frac{P_{x}(1,dy)}{\tilde{\Lambda }(dy)} \right) \tilde{\Lambda }(dy). \end{aligned}$$

Respectively, if there are two reference measure \(\Lambda \) and, say, \(\Lambda '\), then we may take \(\tilde{\Lambda }= \Lambda + \Lambda '\), and the coefficients computed by using each of the two—\(\Lambda \) and \(\Lambda '\)—will be represented via \(\tilde{\Lambda }\) in the same way.

Secondly, let \(\displaystyle f_x(y) = \frac{P_x(1,dy)}{\Lambda (dy)}(y)\). Then,

$$\begin{aligned}&\kappa = \inf _{x,x'} \int \left( \frac{P_{x'}(1,dy)}{P_x(1,dy)}\wedge \frac{P_{x}(1,dy)}{P_x(1,dy)} \right) P_x(1,dy) \\&\qquad \qquad = \inf _{x,x'} \int \left( \frac{P_{x'}(1,dy)}{f_{x}(y)\Lambda (dy)}\wedge \frac{P_{x}(1,dy)}{f_x(y)\Lambda (dy)} \right) f_x(y)\Lambda (dy)\\&\qquad \qquad = \inf _{x,x'} \int \left( \frac{P_{x'}(1,dy)}{\Lambda (dy)}\wedge \frac{P_{x}(1,dy)}{\Lambda (dy)} \right) \Lambda (dy), \end{aligned}$$

as required. The Lemma 38 is proved.

Denote

$$ \kappa (x,x') : = \int \left( \frac{P_{x'}(1,dy)}{P_x(1,dy)}\wedge 1 \right) P_x(1,dy) $$

Clearly, for any \(x,x'\in S\),

$$\begin{aligned} \kappa (x,x') \ge \kappa . \end{aligned}$$
(55)

Lemma 39

For any \(x,x'\in S\),

$$ \kappa (x,x') = \kappa (x',x). $$

Proof

Under the more restrictive assumption (54) we have,

$$ \kappa (x',x) = \int \left( \frac{P_{x'}(1,dy)}{P_x(1,dy)}\wedge 1 \right) P_x(dy) = \int \left( \frac{P_{x'}(1,dy)}{\Lambda (dy)}\wedge \frac{P_{x}(1,dy)}{\Lambda (dy)} \right) \Lambda (dy), $$

which expression is, apparently, symmetric with respect to x and \(x'\), as required. Without assuming (54) we can argue as follows. Denote \(\Lambda _{x,x'}(dz) = P_x(1,dz)+ P_{x'}(1,dz)\). Note that by definition, \(\Lambda _{x,x'} = \Lambda _{x',x}\). Then we have,

$$\begin{aligned} \kappa (x',x)= & {} \int \left( \frac{P_{x'}(1,dy)}{P_x(1,dy)}\wedge 1 \right) P_x(1,dy) \nonumber \\= & {} \int \left( \frac{P_{x'}(1,dy)}{P_x(1,dy)}\wedge 1 \right) \frac{P_x(1,dy)}{\Lambda _{x,x'}(dy)} \Lambda _{x,x'}(dy) \nonumber \\= & {} \int \left( \frac{P_{x'}(1,dy)}{\Lambda _{x,x'}(dy)} \wedge \frac{P_x(1,dy)}{\Lambda _{x,x'}(dy)} \right) \Lambda _{x,x'}(dy). \end{aligned}$$
(56)

The latter expression is symmetric with respect to x and \(x'\), which proves the Lemma 39.

Definition 40

If an MC \((X_n)\) satisfies the condition (52)—we call it MD-condition in the sequel—then we call this process Markov–Dobrushin’s or MD-process .

This condition in an easier situation of finite chains was introduced by Markov himself [30]; later on, for non-homogeneous Markov processes its analogue was suggested and used by Dobrushin [8]. So, we call it Markov–Dobrushin’s condition, as already suggested earlier by Seneta. Note that in all cases \(\kappa \le 1\). The case \(\kappa = 1\) corresponds to the i.i.d. sequence \((X_n)\). In the opposite extreme situation where the transition kernels are singular for different x and \(x'\), we have \(\kappa = 0\). The MD-condition (52)—as well as (54)—is most useful because it provides an effective quantitative upper bound for convergence rate of a Markov chain towards its (unique) invariant measure in total variation metric.

Theorem 41

Let the assumption (52) hold true. Then the process \((X_n)\) is ergodic, i.e. there exists a limiting probability measure \(\mu \), which is stationary and such that (1) holds true. Moreover, the uniform bound is satisfied for every n,

$$\begin{aligned} \sup _{x}\sup _{A\in \mathcal{S}} |P_x(n,A) - \mu (A)| \le (1-\kappa )^{n}. \end{aligned}$$
(57)

Recall that the total variation distance or metric between two probability measures may be defined as

$$ \Vert \mu -\nu \Vert _{TV} := 2 \sup _A (\mu (A) - \nu (A)). $$

Hence, the inequality (57) may be rewritten as

$$\begin{aligned} \sup _{x} \Vert P_x(n,\cdot ) - \mu (\cdot )\Vert _{TV} \le 2 (1-\kappa )^{n}, \end{aligned}$$
(58)

Proof

1. Denote for any measurable \(A \in \mathcal S\),

$$ M^{(n)}(A) := \sup _x P_x(n,A), \quad m^{(n)}(A) := \inf _x P_x(n,A). $$

Due to the Chapman–Kolmogorov equation we have,

$$\begin{aligned}&\displaystyle m^{(n+1)}(A) = \inf _x P_x(n+1, A) = \inf _x \int P_x(dz) P_z(n,A) \\\\&\displaystyle \ge \inf _x \int P_x(dz) m^{(n)}(A) = m^{(n)}(A). \end{aligned}$$

So, the sequence \((m^{(n)}(A))\) does not decrease. Similarly, \((M^{(n)}(A))\) does not increase. We are going to show the estimate

$$\begin{aligned} (0\le ) \;\; M^{(n)}(A)-m^{(n)}(A) \le (1-\kappa )^{n}. \end{aligned}$$
(59)

In particular, it follows that for any \(x, y \in S\) we have,

$$\begin{aligned} | P_x(n,A)- P_y(n,A)| \le (1-\kappa )^{n}. \end{aligned}$$
(60)

More than that, by virtue of (59) and due to the monotonicity (\(M^{(n)}(A)\) decreases, while \(m^{(n)}(A)\) increases) both sequences \(M^{(n)}(A)\) and \(m^{(n)}(A)\) have limits, which limits coincide and are uniform in A:

$$\begin{aligned} \lim \limits _{n\rightarrow \infty } M^{(n)}(A) = \lim \limits _{n\rightarrow \infty } m^{(n)}(A) =: m(A), \end{aligned}$$
(61)

and

$$\begin{aligned} \sup \limits _A |M^{(n)}(A)-m(A)|\vee \sup \limits _A |m^{(n)}(A)-m(A)| \le (1-\kappa )^{n}. \end{aligned}$$
(62)

2. Let \(x,x' \in S\), and let \(\Lambda _{x,x'}\) be some reference measure for both \(P_x(1,dz)\) and \(P_{x'}(1,dz)\). Again by virtue of Chapman–Kolmogorov’s equation we have for any \(n>1\) (recall that we accept the notations, \(a_+ = a\vee 0 \equiv \max (a,0)\), and \(a_- = a\wedge 0 \equiv \min (a,0)\)),

$$\begin{aligned}&\displaystyle P_x(n,A) - P_{x'}(n,A) = \int [P_x(1,dz) - P_{x'}(1,dz)] P_z(n-1,A) \nonumber \\ \nonumber \\ \nonumber&\displaystyle = \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) \Lambda _{x,x'}(dz)\,P_z(n-1,A) \\ \nonumber \\ \nonumber&\displaystyle = \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _+\, \Lambda _{x,x'}(dz)\,P_z(n-1,A) \\ \nonumber \\&\displaystyle + \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _-\, \Lambda _{x,x'}(dz)\,P_z(n-1,A). \end{aligned}$$
(63)

Further, we have,

$$\begin{aligned}&\displaystyle \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _+\, \Lambda _{x,x'}(dz)\,P_z(n-1,A) \\\\&\displaystyle \le \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _+\, \Lambda _{x,x'}(dz)\,M^{(n-1)}(A), \end{aligned}$$

and similarly,

$$\begin{aligned}&\displaystyle \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _- \, \Lambda _{x,x'}(dz)\,P_z(n-1,A) \\\\&\displaystyle \le \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _-\, \Lambda _{x,x'}(dz)\,m^{(n-1)}(A), \end{aligned}$$

On the other hand,

$$\begin{aligned}&\displaystyle \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _+ \Lambda _{x,x'}(dz) + \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _- \Lambda _{x,x'}(dz) \\\\&\displaystyle = \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) \Lambda _{x,x'}(dz) = 1-1=0. \end{aligned}$$

Thus, we get,

$$\begin{aligned}&\displaystyle M^{(n)}(A)-m^{(n)}(A) = \sup _x P_x(n,A) - \inf _{x'}P_{x'}(n,A) \\\\&\displaystyle \le \sup _{x,x'} \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _+\, \Lambda _{x,x'}(dz)\,(M^{(n-1)}(A)-m^{(n)}(A)). \end{aligned}$$

It remains to notice that (recall that \((a-b)_+ = a - a\wedge b \equiv a - \min (a,b)\))

$$\begin{aligned}&\displaystyle \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\right) _+\, \Lambda _{x,x'}(dz) \\\\&\displaystyle = \int \left( \frac{P_x(1,dz)}{\Lambda _{x,x'}(dz)} - \frac{P_{x'}(1,dz)}{\Lambda _{x,x'}(dz)}\wedge \frac{P_{x}(1,dz)}{\Lambda _{x,x'}(dz)}\right) \, \Lambda _{x,x'}(dz) = 1-\kappa (x,x')\le 1-\kappa . \end{aligned}$$

Now the bound (59) follows by induction.

3. Let us establish the existence of at least one stationary distribution. For any \(x\in S\) and any measurable A,

$$\begin{aligned} m^{(n)}(A) \le P_x(n,A) \le M^{(n)}(A). \end{aligned}$$
(64)

Due to (61) and (62), \((P_x(n,A))\) is a Cauchy sequence which converges exponentially fast and uniformly with respect to A. Denote

$$\begin{aligned} q(A):=\lim _{n\rightarrow \infty } P_x(n,A). \end{aligned}$$
(65)

Clearly, due to this uniform convergence, \(q(\cdot )\ge 0\), \(q(S)=1\), and the function q is additive in A. More than that, by virtue of the same uniform convergence in A in (65), the function \(q(\cdot )\) is also ‘continuous at zero’, i.e. it is, actually, a sigma-additive measure. More than that, the uniform convergence implies that

$$\begin{aligned} \Vert P_x(n, \cdot ) - q(\cdot )\Vert _{TV} \rightarrow 0, \quad n\rightarrow \infty . \end{aligned}$$
(66)

4. Now, let us show stationarity. We have,

$$\begin{aligned}&\displaystyle q(A) = \lim _{n\rightarrow \infty } P^{}_{x_0}(n,A) = \lim _{n\rightarrow \infty } \int P^{}_{x_0}(n-1,dz)P_z(A) \\\\&\displaystyle = \int q(dz)P_z(A) + \lim _{n\rightarrow \infty } \int (P^{}_{x_0}(n-1,dz)-q(dz))P_z(A). \end{aligned}$$

Here, in fact, the second term equals zero. Indeed,

$$\begin{aligned}&\displaystyle |\int (P^{}_{x_0}(n-1,dz)-q(dz))P_z(A)| \le \int |P^{}_{x_0}(n-1,dz)-q(dz)|P_z(A) \\\\&\displaystyle \le \int |P^{}_{x_0}(n-1,dz)-q(dz)| = \Vert P^{}_{x_0}(n-1,\cdot )-q(\cdot )\Vert _{TV} \rightarrow 0, \quad n\rightarrow \infty . \end{aligned}$$

Thus, we find that

$$ q(A) = \int q(dz)P_z(A), $$

which is the definition of stationarity. This completes the proof of the Theorem 41.

Corollary 42

For any bounded Borel function f and any \(0\le s<t\),

$$ \sup _x \left| {\mathbb E}_x (f(X_t) | X_s) - {\mathbb E}_{inv} f(X_t)\right| \equiv \sup _x \left| {\mathbb E}_x (f(X_t) - {\mathbb E}_{inv} f(X_t)| X_s)\right| \le C_f (1-\kappa )^{t-s}, $$

or, equivalently,

$$ \sup _x |{\mathbb E}_x (f(X_t) | \mathcal{F}^X_s) - {\mathbb E}_{inv} f(X_t)| \le C_f (1-\kappa )^{t-s}, $$

12 Coupling Method: General Version

This more general version requires a change of probability space so as to construct coupling. Results themselves in no way pretend to be new: we just suggest a presentation convenient for the author. In particular, all newly arising probability spaces on each step (i.e. at each time n) are explicitly shown. By ‘general’ we do not mean that it is the most general possible: this issue is not addressed here. Just it is more general that in the Sect. 5, and it is more involved because of the more complicated probability space, and it provides a better constant in the convergence bound. It turns out that the general version requires a bit of preparation; hence, we start with the section devoted to a couple of random variables, while the case of Markov chains will be considered separately in the next section.

The following folklore yet important lemma answers the following question: suppose we have two distributions, which are not singular, and the ‘common area’ equals some positive constant \(\kappa \). Is it possible to realize these two distributions on the same probability space so that the two corresponding random variables coincide exactly with probability \(\kappa \)? We call one version of this result ‘the lemma about three random variables’, and another one ‘the lemma about two random variables’.

Lemma 43

(‘Of three random variables’) Let \(\xi ^{1}\) and \(\xi ^2\) be two random variables on their (without loss of generality different, and they will be made independent after we take their direct product!) probability spaces \((\Omega ^1, \mathcal{F}^1, \mathbb P^1)\) and \((\Omega ^2, \mathcal{F}^2, \mathbb P^2)\) and with densities \(p^1\) and \(p^2\) with respect to some reference measure \(\Lambda \), correspondingly. Then, if Markov–Dobrushin’s condition holds true,

$$\begin{aligned} \kappa := \int \left( p^1(x)\wedge p^2(x)\right) \Lambda (dx) > 0, \end{aligned}$$
(67)

then there exists one more probability space \((\Omega , \mathcal{F}, \mathbb P)\) and a random variable on it \(\xi ^3\) (and \(\xi ^2\) also lives on \((\Omega , \mathcal{F}, \mathbb P)\), clearly, with the same distribution) such that

$$ \begin{aligned} \mathcal{L}(\xi ^3) =\mathcal{L}(\xi ^1), \quad \& \quad \mathbb P(\xi ^3 = \xi ^2) = \kappa . \end{aligned}$$
(68)

Here \(\mathcal L\) denotes the distribution of a random variable under consideration. Note that in the case \(\kappa = 1\) we have \(p^1 = p^2\), so we can just assign \(\xi ^3:= \xi ^2\), and then immediately both assertions of (68) hold. Mention that even if \(\kappa \) were equal to zero (excluded by the assumption (67)), i.e. the two distributions were singular, we could have posed \(\xi ^3:= \xi ^1\), and again both claims in (68) would have been satisfied trivially. Hence, in the proof below it suffices to assume

$$ 0< \kappa < 1. $$

Proof of the Lemma 43. 1: Construction. Let

$$ A_1 := \{x: \; p^1(x) \ge p^2(x)\}, \quad A_2 := \{x: \; p^1(x) < p^2(x)\}, $$

We will need two new independent random variables, \(\zeta \sim U[0,1]\) (uniformly distributed random variable on [0, 1]) and \(\eta \) with the density

$$ \displaystyle p^\eta (x) := \frac{p^1 - p^1\wedge p^2}{\displaystyle \int (p^1 - p^1\wedge p^2)(y)\Lambda (dy)}(x) \equiv \frac{p^1 - p^1\wedge p^2}{\displaystyle \int \limits _{A_1} (p^1 - p^1\wedge p^2)(y)\Lambda (dy)}(x). $$

Both \(\zeta \) and \(\eta \) are assumed to be defined on their own probability spaces. Now let (on the direct product of all these probability spaces, i.e. of the probability spaces where the random variables \(\xi ^1, \xi ^2, \zeta , \eta \) are defined)

$$ \xi ^3:= \xi ^2 1(\frac{p^1}{p^1\vee p^2}(\xi ^2)\ge \zeta ) + \eta 1(\frac{p^1}{p^1\vee p^2}(\xi ^2) < \zeta ). $$

We shall see that \(\xi ^3\) admits all the desired properties. Denote

$$ C := \{\omega : \; \frac{p^1}{p^1\vee p^2}(\xi ^2)\ge \zeta \}. $$

Then \(\xi ^3\) may be rewritten as

$$\begin{aligned} \xi ^3 = \xi ^2 1(C) + \eta 1(\bar{C}). \end{aligned}$$
(69)

2: Verification. Below \(\mathbb P\) is understood as the probability arising on the direct product of the probability spaces mentioned earlier. Let

$$ c:= \int _{A_1} (p^1(x)-p^2(x))\Lambda (dx) \equiv \int _{A_2} (p^2(x)-p^1(x))\Lambda (dx). $$

Due to our assumptions we have,

$$\begin{aligned}&\displaystyle c+ \kappa = \int _{A_1} (p^1(x)-p^2(x))\Lambda (dx) + \int \left( p^1(x)\wedge p^2(x)\right) \Lambda (dx) \\\\&\displaystyle = \int _{A_1} (p^1(x)-p^2(x))\Lambda (dx) + \int _{A_1} \left( p^1(x)\wedge p^2(x)\right) \Lambda (dx) + \int _{A_2} \left( p^1(x)\wedge p^2(x)\right) \Lambda (dx) \\\\&\displaystyle = \int _{A_1} p^1(x)\Lambda (dx) + \int _{A_2} p^1(x) \Lambda (dx) = \int _{A_1 \cup A_2} p^1(x)\Lambda (dx) = 1. \end{aligned}$$

So,

$$ c = 1-\kappa \in (0,1). $$

Also,

$$ p^\eta (x) = \frac{p^1 - p^1\wedge p^2}{c}(x). $$

Also notice that

$$ \mathbb P(C | \xi ^2) = \frac{p^1}{p^1\vee p^2}(\xi ^2), $$

and recall that on C, \(\xi ^3 = \xi ^2\), while on its complement \(\bar{C}\), \(\xi ^3 = \eta \). Now, for any bounded Borel measurable function g we have,

$$\begin{aligned}&\displaystyle \mathbb E g(\xi ^3) = \mathbb E g(\xi ^3) 1(C) + \mathbb E g(\xi ^3) 1(\bar{C}) = \mathbb E g(\xi ^2) 1(C) + \mathbb E g(\eta ) 1(\bar{C}) \\\\&\displaystyle = \mathbb E g(\xi ^2) \frac{p^1}{p^1\vee p^2}(\xi ^2) + \mathbb E g(\eta )(1-\frac{p^1}{p^1\vee p^2}(\xi ^2)) \\\\&\displaystyle = \mathbb E g(\xi ^2) \frac{p^1}{p^1\vee p^2}(\xi ^2) + \mathbb E g(\eta ) \mathbb E (1-\frac{p^1}{p^1\vee p^2}(\xi ^2)) \\\\&\displaystyle = \int \limits _{A_1 \cup A_2} g(x) \frac{p^1}{p^1\vee p^2}(x) p^2(x)\Lambda (dx) + \int \limits _{(A_1)} g(x) p^\eta (x)\Lambda (dx)\\&\qquad \times \int \limits _{(A_2)} (1-\frac{p^1}{p^1\vee p^2}(y))p^2(y)\Lambda (dy) \\\\&\displaystyle = \int \limits _{A_1} g(x) p^2(x)\Lambda (dx) + \int _{A_2} g(x) p^1(x)\Lambda (dx) + \int \limits _{A_1} g(x) \frac{p^1 - p^2}{c}(x)\Lambda (dx)\\&\qquad \times \int \limits _{A_2} (p^2 - p^1)(y)\Lambda (dy) \\\\&\displaystyle = \int \limits _{A_1 \cup A_2} g(x) p^1(x)\Lambda (dx) = \mathbb E g(\xi ^1). \end{aligned}$$

Here \((A_1)\) in brackets in \(\displaystyle \int \limits _{(A_1)} g(x) p^\eta (x)\Lambda (dx)\) is used with the following meaning: the integral is originally taken over the whole domain, but integration outside the set \(A_1\) gives zero; hence, only the integral over this domain remains. The established equality \(\mathbb E g(\xi ^3)= \mathbb E g(\xi ^1)\) means that \(\mathcal{L}(\xi ^3) = \mathcal{L}(\xi ^1)\), as required.

Finally, from the definition of \(\xi ^3\) it is straightforward that

$$ \mathbb P(\xi ^3 = \xi ^2) \ge \mathbb P(C). $$

So,

$$\begin{aligned} \mathbb P(\xi ^3 = \xi ^2) \ge P(C) = \mathbb E\frac{p^1}{p^1\vee p^2}(\xi ^2) = \int \frac{p^1}{p^1\vee p^2}(x)p^2(x)\Lambda (dx) \\\\\ = \int _{A_1} \frac{p^1}{p^1\vee p^2}(x)p^2(x)\Lambda (dx) + \int _{A_2} \frac{p^1}{p^1\vee p^2}(x)p^2(x)\Lambda (dx) \\\\ = \int _{A_1} p^2(x)\Lambda (dx) + \int _{A_2} \frac{p^1}{p^1\vee p^2}(x)p^1(x)\Lambda (dx) = \int (p^1 \wedge p^2) \Lambda (dx) = \kappa . \end{aligned}$$

Let us argue why, actually,

$$ \mathbb P(\xi ^3 = \xi ^2) = \mathbb P(C) = \kappa , $$

i.e. why the inequality \(\mathbb P(\xi ^3 = \xi ^2) \ge \mathbb P(C)\) may not be strict. Indeed, \(\mathbb P(\xi ^3 = \xi ^2) > \mathbb P(C)\) may only occur if \(\mathbb P(\eta 1(\bar{C}) = \xi ^2)>0\) (cf. with (69)), or, equivalently, if \(\displaystyle \mathbb P\left( \eta 1(\frac{p^1}{p^1\vee p^2}(\xi ^2) < \zeta ) = \xi ^2\right) >0\). However,

$$ \omega \in \bar{C} = \{\omega : \; \frac{p^1}{p^1\vee p^2}(\xi ^2) < \zeta \}. $$

implies \(p^1(\xi _2)<p^2(\xi _2)\), that is, \(\xi _2 \in A_2\).But on this set the density of \(\eta \) equals zero. Hence, \(\mathbb P(\xi ^3 = \xi ^2) > \mathbb P(C)\) is not possible, which means that, in fact, we have proved that \(\mathbb P(\xi ^3 = \xi ^2) = \mathbb P(C) = \kappa \), as required. The Lemma 43 is proved.

Here is another, ‘symmetric’ version of the latter lemma.

Lemma 44

(‘Of two random variables’) Let \(\xi ^{1}\) and \(\xi ^2\) be two random variables on their (without loss of generality different, which will be made independent after we take their direct product!) probability spaces \((\Omega ^1, \mathcal{F}^1, \mathbb P^1)\) and \((\Omega ^2, \mathcal{F}^2, \mathbb P^2)\) and with densities \(p^1\) and \(p^2\) with respect to some reference measure \(\Lambda \), correspondingly. Then, if

$$\begin{aligned} \kappa := \int \left( p^1(x)\wedge p^2(x)\right) \Lambda (dx) > 0, \end{aligned}$$
(70)

then there exists one more probability space \((\Omega , \mathcal{F}, \mathbb P)\) and two random variables on it \(\eta ^1, \eta ^2\) such that

$$ \begin{aligned} \mathcal{L}(\eta ^j) =\mathcal{L}(\xi ^j), \; j=1,2, \quad \& \quad \mathbb P(\eta ^1 = \eta ^2) = \kappa . \end{aligned}$$
(71)

Proof of the Lemma 44. 1: Construction. We will need now four new independent random variables, Bernoulli random variable \(\gamma \) with \(\mathbb P(\gamma =0) = \kappa \) and \(\zeta ^{0,1,2}\) with the densities

$$\begin{aligned} \displaystyle p^{\zeta ^1}(x):= & {} \frac{p^1 - p^1\wedge p^2}{\displaystyle \int (p^1 - p^1\wedge p^2)(y)\Lambda (dy)}(x),\\ p^{\zeta ^2}(x):= & {} \frac{p^2 - p^1\wedge p^2}{\displaystyle \int (p^2 - p^1\wedge p^2)(y)\Lambda (dy)}(x), \\ p^{\zeta ^0}(x):= & {} \frac{ p^1\wedge p^2}{\displaystyle \int (p^1\wedge p^2)(y)\Lambda (dy)}(x). \end{aligned}$$

We may assume that they are all defined on their own probability spaces and eventually we consider the direct product of these probability spaces denoted as \((\Omega , \mathcal{F}, \mathbb P)\). As a result, they are all defined on one unique probability space and they are independent there. Now, on the same product of all probability spaces just mentioned, let

$$ \begin{aligned} \eta ^1:= \zeta ^0 1(\gamma =0) + \zeta ^1 1(\gamma \not =0), \quad \& \quad \eta ^2:= \zeta ^0 1(\gamma =0) + \zeta ^2 1(\gamma \not =0). \end{aligned}$$
(72)

We shall see that \(\eta ^{1,2}\) admit all the desired properties claimed in the Lemma.

2: Verification. From (72), clearly,

$$ \mathbb P(\eta ^1=\eta ^2) \ge \mathbb P(\gamma =0) = \kappa . $$

Yet, we already saw earlier (in slightly different terms) that this may be only an equality, that is, \(\mathbb P(\eta ^1=\eta ^2) = \mathbb P(\gamma =0) = \kappa \).

Next, since \(\gamma \), \(\zeta ^0\) and \(\zeta ^{1}\) are independent on \((\Omega , \mathcal F, \mathbb P)\), for any bounded measurable function g we have,

$$\begin{aligned}&\displaystyle \mathbb E g(\eta ^1) = \mathbb E g(\eta ^1)1(\gamma =0) + \mathbb E g(\eta ^1)1(\gamma \not =0) \\\\&\displaystyle = \mathbb E g(\zeta ^0)1(\gamma =0) + \mathbb E g(\zeta ^1)1(\gamma \not =0) = \mathbb E g(\zeta ^0) \mathbb E 1(\gamma =0) + \mathbb E g(\zeta ^1) \mathbb E 1(\gamma \not =0) \\\\&\displaystyle = \kappa \int g(y) p^{\zeta ^0}(y)\,\Lambda (dy) +(1- \kappa ) \int g(y) p^{\zeta ^1}(y)\,\Lambda (dy) \\\\&\displaystyle = \kappa \int \frac{p^1\wedge p^2}{\displaystyle \int (p^1\wedge p^2)\Lambda (dy)}(x) \Lambda (dx) + (1- \kappa ) \int g(x) \frac{p^1 - p^1\wedge p^2}{\displaystyle \int (p^1 - p^1\wedge p^2)(y)\Lambda (dy)}(x)\Lambda (dx) \\\\&\displaystyle = \int p^1\wedge p^2 (x) \Lambda (dx) + \int g(x) (p^1 - p^1\wedge p^2) (x)\Lambda (dx) = \int g(y) p^1(y)\,dy = \mathbb E g(\xi ^1). \end{aligned}$$

For \(\eta ^2\) the arguments are similar, so also \(\mathbb E g(\eta ^2) = \mathbb E g(\xi ^2)\). The Lemma 44 is proved.

Remark 45

Note that the extended probability space in the proof of the Lemma 44 has the form,

$$ (\Omega , \mathcal{F}, \mathbb P) = (\Omega ^1, \mathcal{F}^1, \mathbb P^1) \times (\Omega ^2, \mathcal{F}^2, \mathbb P^2) \times (\Omega ^\gamma , \mathcal{F}^\gamma , \mathbb P^\gamma ) \times \prod _{k=0}^{2} (\Omega ^{\zeta _k}, \mathcal{F}^{\zeta _k}, \mathbb P^{\zeta _k}). $$

13 General Coupling Method for Markov Chains

Throughout this section the assumption (54) is assumed. In this section, it is explained how to apply general coupling method as in the Sect. 12 to Markov chains in general state spaces \((S, \mathcal{S})\). Various presentations of this method may be found in [19, 29, 33, 40, 43], et al. This section follows the lines from [6], which, in turn, is based on [43]. Note that in [6] the state space was \(\mathbb R^1\); however, in \(\mathbb R^d\) all formulae remain the same. Clearly, this may be further extended to more general state spaces, although, we will not pursue this goal here.

Let us generalize the Lemma 44 to a sequence of random variables and present our coupling construction for Markov chains based on [43]. Assume that the process has a transition density p(xy) with respect to some reference measure \(\Lambda \) and consider two versions \((X^1_n), (X^2_n)\) of the same Markov process with two initial distributions, respectively, which also have densities with respect to this \(\Lambda \) denoted by \(p_{X^1_0}\) and \(p_{X^2_0}\) (of course, this does not exclude the case of non-random initial states). Let

$$\begin{aligned} \kappa (u,v):= \int p(u,t) \wedge p(v,t)\,\Lambda (dt), \quad \kappa = \inf \limits _{u,v}\kappa (u,v), \end{aligned}$$
(73)

and

$$\begin{aligned} \kappa (0) :=\int p_{X^1_0}(t) \wedge p_{X^2_0}(t)\,\Lambda (dt). \end{aligned}$$
(74)

It is clear that \(0\le \kappa (u,v)\le 1\) for all uv. Note that \(\kappa (0)\) is not the same as \(\kappa _0\) in the previous sections. We assume that \(X^1_0\) and \(X^2_0\) have different distributions, so \(\kappa (0)<1\). Otherwise we obviously have \(X^1_n{\mathop {=}\limits ^{d}}X^2_n\) (equality in distribution) for all n, and the coupling can be made trivially, for example, by letting \(\widetilde{X}^1_n= \widetilde{X}^2_n:=X^1_n\).

Let us introduce a new, vector-valued Markov process \(\left( \eta ^1_n,\eta ^2_n,\xi _n,\zeta _n\right) \). If \(\kappa _0=0\) then we set

$$\begin{aligned} \eta ^1_0:=X^1_0,\; \eta ^2_0:=X^2_0,\; \xi _0:=0,\; \zeta _0:=1. \end{aligned}$$

Otherwise, if \(0<\kappa (0)<1\), then we apply the Lemma 44 to the random variables \(X^1_0\) and \(X^2_0\) so as to create the random variables \(\eta ^1_0\), \(\eta ^2_0\), \(\xi _0\) and \(\zeta _0\) (they correspond to \(\eta ^1, \eta ^2, \xi \), and \(\zeta \) in the Lemma). Now, assuming that the random variables \(\left( \eta ^1_n,\eta ^2_n,\xi _n,\zeta _n\right) \) have been determined for some n, let us show how to construct them for \(n+1\). For this aim, we define the transition probability density \(\varphi \) with respect to the same measure \(\Lambda \) for this (vector-valued) process as follows,

$$\begin{aligned} \varphi (x,y):=\varphi _1(x,y^1)\varphi _2(x,y^2)\varphi _3(x,y^3) \varphi _4(x,y^4), \end{aligned}$$
(75)

where \(x=(x^1,x^2,x^3,x^4)\), \(y=(y^1,y^2,y^3,y^4)\), and if \(0<\kappa (x^1,x^2)<1\), then

$$\begin{aligned}&\displaystyle \varphi _1(x,u):=\frac{p(x^1,u)-p(x^1, u)\wedge p(x^2,u)}{1-\kappa (x^1,x^2)}, \quad \varphi _2(x,u):=\frac{p(x^2,u)-p(x^1,u)\wedge p(x^2,u)}{1-\kappa (x^1,x^2)}, \nonumber \\\nonumber \\\nonumber&\displaystyle \varphi _3(x,u):=1(x^4=1)\frac{p(x^1,u)\wedge p(x^2,u)}{\kappa (x^1,x^2)}+1(x^4=0)p(x^3,u), \\\nonumber \\&\displaystyle \varphi _4(x,u):=1(x^4=1)\left( \delta _1(u)(1-\kappa (x^1,x^2))+ \delta _0(u)\kappa (x^1,x^2)\right) +1(x^4=0)\delta _0(u), \end{aligned}$$
(76)

where \(\delta _i(u)\) is the Kronecker symbol, \(\delta _i(u) = 1(u=i)\), or, in other words, the delta measure concentrated at state i. The case \(x^4=0\) signifies coupling already realized at the previous step, and \(u=0\) means successful coupling at the transition. In the degenerate cases, if \(\kappa (x^1,x^2)=0\) (coupling at the transition is impossible), then we may set, e.g.

$$ \varphi _3(x,u):=1(x^4=1)1(0<u<1) + 1(x^4=0)p(x^3,u), $$

and if \(\kappa (x^1,x^2)=1\), then we may set

$$ \varphi _1(x,u)=\varphi _2(x,u):=1(0<u<1). $$

In fact, in both degenerate cases \(\kappa (x^1,x^2)=0\) or \(\kappa (x^1,x^2)=1\), the functions \(\varphi _3(x,u)1(x^4=1)\) (or, respectively, \(\varphi _1(x,u)\) and \(\varphi _2(x,u)\)) can be defined more or less arbitrarily, only so as to keep the property of conditional independence of the four random variables \(\left( \eta ^1_{n+1},\eta ^2_{n+1},\xi _{n+1},\zeta _{n+1}\right) \) given \(\left( \eta ^1_n,\eta ^2_n,\xi _n,\zeta _n\right) \).

Lemma 46

Let the random variables \(\widetilde{X}^1_n\) and \(\widetilde{X}^2_n\), for \(n\in \mathbb {Z}_+\) be defined by the following formulae:

$$\begin{aligned} \widetilde{X}^1_n:=\eta ^1_n 1(\zeta _n=1)+\xi _n 1(\zeta _n=0), \quad \widetilde{X}^2_n:=\eta ^2_n 1(\zeta _n=1)+\xi _n 1(\zeta _n=0). \end{aligned}$$

Then

$$ {\widetilde{X}}^1_n{\mathop {=}\limits ^{d}}X^1_n, \,\, {\widetilde{X}}^{2}_{n}{\mathop {=}\limits ^{d}}X^{2}_{n}, \quad \text{ for } \text{ all } n\ge 0. $$

Moreover,

$$ \widetilde{X}^1_n=\widetilde{X}^2_n, \quad \forall \; n\ge n_0(\omega ):=\inf \{k\ge 0: \zeta _k=0\}, $$

and

$$\begin{aligned} \mathbb {P}(\widetilde{X}^1_n\ne \widetilde{X}^2_n)\le (1-\kappa (0))\, \mathbb {E}\prod _{i=0}^{n-1} (1-\kappa (\eta ^1_i,\eta ^2_i)). \end{aligned}$$
(77)

Moreover, \(\left( \widetilde{X}^1_n\right) _{n\ge 0}\) and \(\left( \widetilde{X}^2_n\right) _{n\ge 0}\) are both homogeneous Markov processes, and

$$ \begin{aligned} \left( \widetilde{X}^1_n\right) _{n\ge 0}{\mathop {=}\limits ^{d}}\left( X^1_n\right) _{n\ge 0}, \quad \& \quad \left( \widetilde{X}^2_n\right) _{n\ge 0}{\mathop {=}\limits ^{d}}\left( X^2_n\right) _{n\ge 0}. \end{aligned}$$
(78)

Informally, the processes \(\eta ^1_n\) and \(\eta ^2_n\) represent \(X^1_n\) and \(X^2_n\), correspondingly, under condition that the coupling was not successful until time n, while the process \(\xi _n\) represents both \(X^1_n\) and \(X^2_n\) if the coupling does occur no later than at time n. The process \(\zeta _n\) represents the moment of coupling: the event \(\zeta _n=0\) is equivalent to the event that coupling occurs no later than at time n. As it follows from (75) and (76),

$$\begin{aligned}&\displaystyle \mathbb P(\zeta _{n+1}=0|\zeta _n=0)=1, \\ \\&\displaystyle \mathbb P(\zeta _{n+1}=0|\zeta _n=1,\eta ^1_n=x^1,\eta ^2_n=x^2)=\kappa (x^1,x^2). \end{aligned}$$

Hence, if two processes were coupled at time n, then they remain coupled at time \(n+1\), and if they were not coupled, then the coupling occurs with the probability \(\kappa (\eta ^1_n,\eta ^2_n)\). At each time the probability of coupling at the next step is as large as possible, given the current states.

For the proof of Lemma 46 see [6].

From the last lemma a new version of the exponential bound in the Ergodic Theorem may be derived. In general, it may somehow improve the estimate based on the constant \(\kappa \) from Markov–Dobrushin’s condition (52) or (54). In the remaining paragraphs we do not pursue the most general situation restricting ourselves again to a simple setting of \(|S|<\infty \). Introduce the operator V acting on a (bounded continuous) function h on the space \(S \times S\) as follows: for \(x=(x^1, x^2)\in S \times S\) and \(X_n := (\tilde{X}^1_n, \tilde{X}^2_n)\),

$$\begin{aligned} Vh(x) := (1-\kappa (x^1,x^2)) \mathbb E_{x^1,x^2}h(X_1) \equiv \exp (\psi (x))\mathbb E_{x^1,x^2}h(X_1), \end{aligned}$$
(79)

where in the last expression \(\psi (x):= \ln (1-\kappa (x^1,x^2))\). The aim is now to find out whether the geometric bound \((1-\kappa )^n\) in (5) under the assumption (3) is the optimal one, or it could be further improved, let under some additional assumptions. Let us rewrite the estimate (77) as follows:

$$\begin{aligned} \mathbb {P}(\widetilde{X}^1_n\ne \widetilde{X}^2_n)\le (1-\kappa (0)) V^n 1(x). \end{aligned}$$
(80)

Note that by definition (79), for the non-negative matrix V its sup-norm \(\Vert V\Vert = \Vert V\Vert _{B,B}:=\sup \limits _{|h|_B\le 1} |Vh|_B \) equals \(\sup \limits _{x} V1(x)\), where \(|h|_B := \max \limits _x |h(x)|\) and \(1=1(x)\) is considered as a function on \(S\times S\) identically equal to one. Note that \(\sup \limits _{x} V1(x)=1-\kappa \).

Now the well-known inequality (see, for example, [25, §8]) reads,

$$\begin{aligned} r(V) \le \Vert V\Vert = (1-\kappa ). \end{aligned}$$
(81)

Further, from the Perron–Frobenius Theorem it follows (see, e.g. [14, (7.4.10)]),

$$\begin{aligned} \lim _n \frac{1}{n} \, \ln V^n 1(x) = \ln r(V). \end{aligned}$$
(82)

The assertions (80) and (82) together lead to the following result.

Theorem 47

Let state space S be finite and let the Markov condition (3) be satisfied. Then

$$\begin{aligned} \limsup \limits _{n\rightarrow \infty } \frac{1}{n} \ln \Vert P_x(n,\cdot ) - \mu (\cdot )\Vert _{TV} \le \ln r(V). \end{aligned}$$
(83)

In other words, for any \(\epsilon >0\) and n large enough,

$$\begin{aligned} \Vert P_x(n,\cdot ) - \mu (\cdot )\Vert _{TV} \le (r(V)+ \epsilon )^n, \end{aligned}$$
(84)

which is strictly better than (5) if \(r(V)<\Vert V\Vert = 1-\kappa \) and \(\epsilon >0\) is chosen small enough, i.e. so that \(r(V)+ \epsilon < 1-\kappa \). It is also likely to be true in more general cases for compact operators V where \(r(V)+ \epsilon < 1-\kappa \). However, the full problem remains open.