Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Laws of Large Numbers

Suppose we have a sequence of trials in each of which a certain event A can occur with probability p independently of the outcomes of other trials. Form a sequence of random variables as follows. Put ξ k =1 if the event A has occurred in the k-th trial, and ξ k =0 otherwise. Then \((\xi_{k} )_{k=1}^{\infty}\) will be a sequence of independent random variables which are identically distributed according to the Bernoulli law: P(ξ k =1)=p, P(ξ k =0)=q=1−p, E ξ k =p, \(\operatorname{Var}(\xi_{k} ) =pq\). The sum \(S_{n} =\xi_{1} +\cdots+ \xi_{n} \mathbin {{\subset }\hspace {-.7em}{=}}{\bf B}_{p}^{n}\) is simply the number of occurrences of the event A in the first n trials. Clearly E S n =np and \(\operatorname {Var}(S_{n} )=npq\).

The following assertion is called the law of large numbers for the Bernoulli scheme.

Theorem 5.1.1

For any ε>0

$$\mathbf{P} \biggl( \biggl| \frac{S_n}{n}-p \biggr| >\varepsilon \biggr) \to0 \quad \mbox{\textit{as} } n \to\infty. $$

This assertion is a direct consequence of Theorem 4.7.5. One can also obtain the following stronger result:

Theorem 5.1.2

(The Strong Law of Large Numbers for the Bernoulli Scheme)

For any ε>0, as n→∞,

$$\mathbf{P} \biggl( \,\sup_{k \ge n} \biggl| \frac{S_k}{k} -p \biggr| > \varepsilon \biggr) \to0. $$

The interpretation of this result is that the notion of probability which we introduced in Chaps. 1 and 2 corresponds to the intuitive interpretation of probability as the limiting value of the relative frequency of the occurrence of the event. Indeed, S n /n could be considered as the relative frequency of the event A for which P(A)=p. It turned out that, in a certain sense, S n /n converges to p.

Proof of Theorem 5.1.2

One has

$$\begin{aligned} \mathbf{P} \biggl( \,\sup_{k \ge n} \biggl| \frac{S_k}{k} -p \biggr| > \varepsilon \biggr) =&\mathbf{P} \Biggl( \,\bigcup_{k=n}^{\infty} \biggl\{ \biggl| \frac{S_k}{k} -p \biggr| >\varepsilon \biggr\} \Biggr) \\\le&\sum_{k=n}^{\infty} \mathbf{P} \biggl( \biggl| \frac{S_k}{k}-p \biggr| >\varepsilon \biggr) \le\sum_{k=n}^{\infty} \frac{\mathbf{E}(S_k -kp)^4}{k^4 \varepsilon^4 }. \\\end{aligned}$$
(5.1.1)

Here we again made use of Chebyshev’s inequality but this time for the fourth moments. Expanding we find that

$$\begin{aligned} \mathbf{E}(S_k -kp)^4 =&\mathbf{E} \Biggl( \,\sum _{j=1}^{k} (\xi_j -p) \Biggr)^4 = \sum_{j=1}^{k} \mathbf{E}(\xi_j -p)^4 + 6\sum _{i<j} (\xi_i -p)^2( \xi_j -p)^2 \\=&k\bigl(pq^4 +qp^4\bigr) +3k(k-1) (pq)^2 \le k+k(k-1)=k^2. \end{aligned}$$
(5.1.2)

Thus the probability we want to estimate does not exceed the sum

$$\begin{aligned} \varepsilon^{-4} \sum_{k=n}^{\infty} k^{-2} \to0\quad\mbox {as } n \to \infty. \end{aligned}$$

 □

It is not hard to see that we would not have found the required bound if we used Chebyshev’s inequality with second moments in (5.1.1).

We could also note that one actually has much stronger bounds for P(|S k kp|>εk) than those that we made use of above. These will be derived in Sect. 5.5.

Corollary 5.1.1

If f(x) is a continuous function on [0,1] then, as n→∞,

$$ \mathbf{E}f \biggl(\frac{S_n}{n} \biggr) \to f(p) $$
(5.1.3)

uniformly in p.

Proof

For any ε>0,

$$\begin{aligned}[c] \mathbf{E} \biggl| f \biggl( \frac{S_n}{n} \biggr) -f(p) \biggr| \le& \mathbf{E} \biggl( \biggl| f \biggl( \frac{S_n}{n} \biggr) -f(p) \biggr|; \biggl| \frac{S_n}{n} -p \biggr| \le\varepsilon \biggr) \\ &{} + \mathbf{E} \biggl( \biggl| f \biggl( \frac{S_n}{n} \biggr) -f(p) \biggr|; \biggl| \frac{S_n}{n} -p \biggr| > \varepsilon \biggr) \\ \le& \sup_{|x| \le\varepsilon} \bigl|f(p+x) -f(p)\bigr| + \delta _n( \varepsilon), \end{aligned} $$

where the quantity δ(ε) is independent of p by virtue of (5.1.1), (5.1.2), and since δ n (ε)→0 as n→∞. □

Corollary 5.1.2

If f(x) is continuous on [0,1], then, as n→∞,

$$\sum_{k=0}^{n} f \biggl( \frac{k}{n} \biggr){n\choose k} x^k (1-x)^{n-k} \to f(x) $$

uniformly in x∈[0,1].

This relation is just a different form of (5.1.3) since

$$\mathbf{P}(S_n =k) =\binom{n}{ k}p^k (1-p)^{n-k} $$

(see Chap. 1). This relation implies the well-known Weierstrass theorem on approximation of continuous functions by polynomials. Moreover, the required polynomials are given here explicitly—they are Bernstein polynomials.

5.2 The Local Limit Theorem and Its Refinements

5.2.1 The Local Limit Theorem

We know that \(\mathbf{P}(S_{n} =k)= \binom{n}{ k} p^{k} q^{n-k}\), q=1−p. However, this formula becomes very inconvenient for computations with large n and k, which raises the question about the asymptotic behaviour of the probability P(S n =k) as n→∞.

In the sequel, we will write a n b n for two number sequences {a n } and {b n } if a n /b n →1 as n→∞. Such sequences {a n } and {b n } will be said to be equivalent.

Set

$$ H(x) =x \ln{\frac{x}{p}} + (1-x) \ln{\frac{1-x}{1-p}}, \qquad p^* = \frac{k}{n}. $$
(5.2.1)

Theorem 5.2.1

As k→∞ and nk→∞,

$$ \mathbf{P}(S_n =k) =\mathbf{P} \biggl( \frac{S_n}{n} =p^* \biggr) \sim \frac{1}{\sqrt{2\pi np^* (1-p^* )}} \exp\bigl\{ {-}nH \bigl(p^* \bigr) \bigr\}. $$
(5.2.2)

Proof

We will make use of Stirling’s formula according to which \(n! \sim\sqrt{2\pi n} n^{n} e^{-n}\) as n→∞. One has

$$\begin{aligned} \mathbf{P}(S_n = k) =& \binom{n}{k}p^k q^{n-k} \sim \sqrt{\frac{n}{2\pi k(n-k)}} \frac{n^n}{k^k (n-k)^{n-k}} p^k (1-p)^{n-k} \\=& \frac{1}{\sqrt{2\pi np^* (1-p^* )}}\\&{}\times \exp \biggl\{ -k \ln\frac{k}{n} -(n-k)\ln \frac{n-k}{n} +k\ln p +(n-k) \ln{(1-p)} \biggr\} \\=& \frac{1}{\sqrt{2\pi np^* (1-p^* )}} \exp \bigl\{ {-}n \bigl[p^* \ln{p^*}+\bigl(1-p^*\bigr) \ln\bigl(1-p^*\bigr) \\&{}-p^* \ln p-\bigl(1-p^*\bigr)\ln(1-p)\bigr] \bigr\} \\=&\frac{1}{\sqrt{2\pi np^* (1-p^* )}} \exp \bigl\{ nH\bigl(p^* \bigr) \bigr\}. \end{aligned}$$

 □

If p =k/n is close to p, then one can find another form for the right-hand side of (5.2.2) which is of significant interest. Note that the function H(x) is analytic on the interval (0,1). Since

$$ H'(x) =\ln\frac{x}{p}-\ln \frac{1-x}{1-p},\quad H''(x) =\frac{1}{p} + \frac{1}{1-x}, $$
(5.2.3)

one has H(p)=H′(p)=0 and, as p p→0,Footnote 1

$$H\bigl(p^* \bigr)= \frac{1}{2} \biggl( \frac{1}{p} + \frac{1}{q} \biggr) \bigl(p^* -p\bigr)^2 +O\bigl(\bigl|p^* -p \bigr|^3\bigr). $$

Therefore if p p and n(p p)3→0 then

$$\mathbf{P}(S_n =k) \sim\frac{1}{\sqrt{2\pi pq}} \exp \biggl\{ {-} \frac{n}{2pq} \bigl(p^* -p\bigr)^2 \biggr\} . $$

Putting

$$\varDelta=\frac{1}{\sqrt{npq}},\qquad\varphi(x)=\frac{1}{\sqrt{2\pi}} e^{-{x^2}/2}, $$

one obtains the following assertion.

Corollary 5.2.1

If z=n(p p)=knp=o(n 2/3) then

$$ \mathbf{P}(S_n =k)=\mathbf{P}(S_n -np=z) \sim\varphi(z\varDelta )\varDelta, $$
(5.2.4)

where φ=φ 0,1(x) is evidently the density of the normal distribution with parameters (0,1).

This formula also enables one to estimate the probabilities of the events of the form {S n <k}.

If p differs substantially from p, then one could estimate the probabilities of such events using the results of Sect. 1.3.

Example 5.2.1

In a jury consisting of an odd number n=2m+1 of persons, each member makes a correct decision with probability p=0.7 independently of the other members. What is the minimum number of members for which the verdict rendered by the majority of jury members will be correct with a probability of at least 0.99?

Put ξ k =1 if the k-th jury member made a correct decision and ξ k =0 otherwise. We are looking for odd numbers n for which P(S n m)≤0.01. It is evident that such a trustworthy decision can be achieved only for large values of n. In that case, as we established in Sect. 1.3, the probability P(S n m) is approximately equal to

$$\frac{(n+1-m)p}{(n+1)p-m} \mathbf{P}(S_n =m) \approx\frac{p}{2p-1} \mathbf{P}(S_n =m) . $$

Using Theorem 5.2.1 and the fact that in our problem

$$p^* \approx\frac{1}{2},\qquad H \biggl(\frac{1}{2} \biggr) =- \frac{1}{2} \ln4p(1-p),\qquad H' \biggl( \frac{1}{2} \biggr)= \ln \biggl(\frac{1-p}{p} \biggr), $$

we get

$$\begin{aligned} \mathbf{P}(S_n \le m) \approx&\frac{p}{2p-1} \sqrt{ \frac{2}{\pi n}} \exp \biggl\{ {-}nH \biggl( \frac{1}{2} - \frac{1}{2n} \biggr) \biggr\} \\\approx&\frac{p}{2p-1} \sqrt{\frac{2}{\pi n}} \exp \biggl\{ {-}nH \biggl( \frac{1}{2} \biggr)+ \frac{1}{2} H' \biggl( \frac{1}{2} \biggr) \biggr\} \\\approx&\frac{\sqrt{2\pi(1-p)}}{(2p-1)\sqrt{\pi n}} \bigl(\sqrt{4p(1-p)}\bigr)^n \approx0.915 \frac{1}{\sqrt{n}} (0.84)^{n/2}. \end{aligned}$$

On the right-hand side there is a monotonically decreasing function a(n). Solving the equation a(n)=0.01 we get the answer n=33. The same result will be obtained if one makes use of the explicit formulae.

5.2.2 Refinements of the Local Theorem

It is not hard to bound the error of approximation (5.2.2). If, in Stirling’s formula \(n! =\sqrt{2\pi n}n^{n} e^{-n+\theta(n)}\), we make use of the well-known inequalitiesFootnote 2

$$\frac{1}{12n +1} < \theta(n) < \frac{1}{12n}, $$

then the same argument will give the following refinement of Theorem 5.2.1.

Theorem 5.2.2

$$ \mathbf{P}(S_n =k)=\frac{1}{\sqrt{2\pi np^* (1-p^* )}} \exp\bigl\{ nH \bigl(p^* \bigr) +\theta(k,n) \bigr\}, $$
(5.2.5)

where

$$ \bigl|\theta(k,n) \bigr| = \bigl|\theta(n) -\theta(k)\theta(n-k) \bigr| < \frac{1}{12k}+\frac{1}{12(n-k)}= \frac{1}{12 np^* (1-p^* )}. $$
(5.2.6)

Relation (5.2.4) could also be refined as follows.

Theorem 5.2.3

For all k such that \(|p^{*} -p| \le\frac{1}{2} \min(p,q)\) one has

$$\mathbf{P}(S_n =k)=\varphi(z {\varDelta})\varDelta\bigl(1+\varepsilon(k,n) \bigr), $$

where

$$1+\varepsilon(k,n) =\exp \biggl\{ \vartheta \biggl( \frac{|z|^3 \varDelta^4}{3}+ \biggl(|z|+\frac{1}{6} \biggr)\varDelta^2 \biggr) \biggr\}, \quad |\vartheta|<1. $$

As one can easily see from the properties of the Taylor expansion of the function e x, the order of magnitude of the term ε(k,n) in the above formulae coincides with that of the argument of the exponential. Hence it follows from Theorem 5.2.3 that for z=knp=o(Δ −4/3) or, which is the same, z=o(n 2/3), one still has (5.2.4).

Proof

We will make use of Theorem 5.2.2. In addition to formulae (5.2.3) one can write:

where we can estimate the residual \(R_{1} =\sum_{k=3}^{\infty} \frac{H^{(k)} (p)}{k!} (p^{*} -p)\). Taking into account that

$$\bigl|H^{(k)} (p)\bigr| \le(k-2)! \biggl( \frac{1}{p^{k-1}}+ \frac{1}{q^{k-1}} \biggr) ,\quad k\ge2, $$

and letting for brevity |p p|=δ, we get for \(\delta\le \frac{1}{2}\min(p,q)\) the bounds

$$\begin{aligned} |R_1| \le& \sum_{k=3}^{\infty} \frac{(k-2)!}{k!} \biggl( \frac{1}{p^{k-1}}+\frac{1}{q^{k-1}} \biggr) \le \frac{\delta^3}{6} \biggl( \frac{1}{p^2}\frac{1}{1-\frac{\delta}{p}} + \frac{1}{q^2}\frac{1}{1-\frac{\delta}{q}} \biggr) \\\le& \frac{\delta}{6} \biggl( \frac{2}{p^2} + \frac{2}{q^2} \biggr) <\frac{\delta^3}{3(pq)^2}. \end{aligned}$$

From this it follows that

$$ -nH\bigl(p^* \bigr)=-\frac{(k-np)^2}{2npq} +\frac{\vartheta_1 |k-np|^3}{3(npq)^2} =- \frac{z^2 \varDelta^2 }{2}+\frac{\vartheta_1|z|^3 \varDelta^4 }{3}, \quad|\vartheta_1 |<1. $$
(5.2.7)

We now turn to the other factors in equality (5.2.5) and consider the product p (1−p ). Since −p≤1−pp ≤1−p, we have

$$\bigl|p^* \bigl(1-p^* \bigr)-p(1-p ) \bigr|= \bigl|\bigl(p-p^* \bigr) \bigl(1-p-p^* \bigr) \bigr| \le\bigl|p^* -p\bigr|\max(p,q) . $$

This implies in particular that, for \(|p^{*} -p| < \frac{1}{2} \min(p,q)\), one has

$$\bigl|p^* \bigl(1-p^* \bigr)-pq \bigr| < \frac{1}{2} pq, \qquad p^* \bigl(1-p^* \bigr) > \frac{1}{2}pq. $$

Therefore one can write along with (5.2.6) that, for the values of k indicated in Theorem 5.2.3,

$$ \bigl|\theta(k ,n)\bigr| < \frac{1}{6npq} =\frac{\varDelta^2}{6}. $$
(5.2.8)

It remains to consider the factor [p (1−p )]−1/2. Since for |γ|<1/2

$$\bigl| \ln(1+\gamma)\bigr| =\biggl \vert \int_1^{1+\gamma} \frac{1}{x}dx \biggr \vert < 2|\gamma|, $$

one has for δ=|p p|<(1/2)min(p,q) the relations

$$\begin{aligned} \begin{aligned}[c] \ln\bigl(p^* \bigl(1-p^* \bigr)\bigr) =& \ln pq + \ln \biggl( 1+ \frac{p^* (1-p^* )-pq}{pq} \biggr) \\=&\ln(pq) +\ln \biggl( 1-\frac{\vartheta^* \delta}{pq} \biggr) , \quad\bigl|\vartheta^* \bigr|< \max(p, q); \\ \ln \biggl( 1-\frac{\vartheta^* \delta}{pq} \biggr)=&-\frac{2\vartheta_2 \delta}{pq}, \quad|\vartheta_2| < \max(p,q), \\{\bigl[p^* \bigl(1-p^* \bigr)\bigr]^{-1/2}} =&[pq]^{-1/2} \exp \biggl\{\frac{\vartheta_2 \delta}{pq} \biggr\}. \end{aligned} \end{aligned}$$
(5.2.9)

Using representations (5.2.7)–(5.2.9) and the assertion of Theorem 5.2.2 completes the proof. □

One can see from the above estimates that the bounds for ϑ in the statement of Theorem 5.2.3 can be narrowed if we consider smaller deviations |p p|—if they, say, do not exceed the value αmin(p,q) where α<1/2.

The relations for P(S n =k) that we found are the so-called local limit theorems for the Bernoulli scheme and their refinements.

5.2.3 The Local Limit Theorem for the Polynomial Distributions

The basic asymptotic formula given in Theorem 5.2.1 admits a natural extension to the polynomial distribution \({\bf B}_{p}^{n}\), p=(p 1,…,p r ), when, in a sequence of independent trials, in each of the trials one has not two but r≥2 possible outcomes A 1,…,A r of which the probabilities are equal to p 1,…,p r , respectively. Let \(S_{n}^{(j)}\) be the number of occurrences of the event A j in n trials,

$$S_n =\bigl(S_n^{(1)} ,\ldots, S_n^{(r)}\bigr),\qquad k=(k_1 , \ldots,k_r), \qquad p^* = \frac{k}{n}, $$

and put H(x)=∑x i ln(x i /p i ), x=(x 1,…,x r ). Clearly, \(S_{n} \mathbin {{\subset }\hspace {-.7em}{=}}\mathbf{B}_{p}^{n}\). The following assertion is a direct extension of Theorem 5.2.1.

Theorem 5.2.4

If each of the r variables k 1,…,k r is either zero or tends toas n→∞ then

where r 0 is the number of variables k 1,…,k r which are not equal to zero.

Proof

As in the proof of Theorem 5.2.1, we will use Stirling’s formula

$$n! \sim\sqrt{2\pi n} e^{-n}n^n $$

as n→∞. Assuming without loss of generality that all k j →∞, j=1,…,r, we get

$$\begin{aligned} \mathbf{P}(S_n =k) \sim& (2\pi)^{(1-r)/2} \biggl( \frac{n}{\prod_{j=1}^r k_j} \biggr)^{1/2} \prod_{j=1}^r \biggl( \frac{np_j }{k_j } \biggr)^{k_j} \\=& (2\pi n )^{(1-r)/2} \Biggl( \,\prod_{j=1}^r p_j^* \Biggr)^{-1/2} \exp \Biggl\{ n \sum _{j=1}^{r} \frac{k_j }{n} \ln \frac{p_j n}{k_j } \Biggr\} . \end{aligned}$$

 □

5.3 The de Moivre–Laplace Theorem and Its Refinements

Let a and b be two fixed numbers and \(\zeta_{n} =(S_{n} -np)/\sqrt{npq}\). Then

$$\mathbf{P}(a<\zeta_n <b)=\sum_{a\sqrt{npq} <z <b\sqrt{npq}} \mathbf {P}(S_n -np =z). $$

If, instead of P(S n np=z), we substitute here the values φ()Δ (see Corollary 5.2.1), we will get an integral sum ∑ a<<b φ()Δ corresponding to the integral \(\int_{a}^{b} \varphi(x)\,dx\).

Thus relations (5.2.4) make the equality

$$ \lim_{n \to\infty} \mathbf{P}(a<\zeta_n <b) =\int_a^b \varphi (x)\,dx = \varPhi (b)- \varPhi(a) $$
(5.3.1)

plausible, where Φ(x) is the normal distribution function with parameters (0,1):

$$\varPhi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{x} e^{-{t^2}/{2}}\,dt. $$

This is the de Moivre–Laplace theorem, which is one of the so-called integral limit theorems that describe probabilities of the form P(S n <x). In Chap. 8 we will derive more general integral theorems from which (5.3.1) will follow as a special case.

Theorem 5.2.3 makes it possible to obtain (5.3.1) together with an error bound or, in other words, with a bound for the convergence rate.

Let A and B be integers,

$$ a= \frac{A -np}{\sqrt{npq}}, \qquad b= \frac{B -np}{\sqrt{npq}}. $$
(5.3.2)

Theorem 5.3.1

Let b>a, c=max(|a|,|b|), and

$$\rho= \frac{c^3 +3c}{3}\varDelta+\frac{\varDelta^2}{6}. $$

If \(\varDelta=1/\sqrt{npq} \le1/2\) and ρ≤1/2 then

$$ \mathbf{P}(A \le S_n < B ) = \mathbf{P}(a \le \zeta_n <b) = \int_a^b \varphi(t) \,dt(1+\vartheta_1 \varDelta c) (1+2\vartheta_2 \rho), $$
(5.3.3)

where |ϑ i |≤1, i=1,2.

This theorem shows that the left-hand side in (5.3.3) can be equivalent to Φ(b)−Φ(a) for growing a and b as well. In that case, Φ(b)−Φ(a) can converge to 0, and knowing the relative error in (5.3.1) is more convenient since its smallness enables one to establish that of the absolute error as well, but not vice versa.

Proof

First we note that, for all k such that \(|z| =|k-np| < c\sqrt{npq}\), the conditions of Theorem 5.2.3 will hold. Indeed, to have the inequality |p p|<(1/2)min(p,q) it suffices that |knp|<npq/2=1/(2Δ 2). This inequality will hold if c<1/(2Δ). But since ρ≤1/2, one has

$$\frac{c(c^2 +3)\varDelta}{3} <1/2 ,\qquad c\varDelta< 1/2 . $$

Thus, for each k such that \(a\sqrt{npq} \le z < b\sqrt{npq}\), we can make use of Theorem 5.2.3 to conclude that

(5.3.4)

where |ϑ|<1. Since, for ρ≤1,

$$\biggl \vert \frac{e^p -1}{\rho}\biggr \vert < e-1 <2, $$

the absolute value of the correction term in (5.3.4) does not exceed (substituting there =c)

$$\biggl| \exp \biggl\{ \vartheta \biggl( \frac{c^3\varDelta}{3} +c\varDelta + \frac{\varDelta^2 }{6} \biggr) \biggr\} -1 \biggr| \le2\vartheta \biggl( \frac{c^3\varDelta}{3} +c\varDelta+\frac{\varDelta^2 }{6} \biggr) =2\vartheta p . $$

Therefore

$$ \mathbf{P}(A \le S_n <B)=\sum _{a \le z\varDelta<b } \varphi(z\varDelta )\varDelta[1+2\vartheta_1 \rho], $$
(5.3.5)

where |ϑ 1|<1.

Now we transform the sum on the right-hand side of the last equality. To this end, note that, for any smooth function φ(x),

$$ \biggl \vert \varDelta\varphi(x) - \int_x^{x + \varDelta} \varphi(t)\,dt \biggr \vert = \frac{\varDelta^2 }{2} \max_{x\le t \le x + \varDelta}\bigl| \varphi' (t)\bigr|. $$
(5.3.6)

But for the function \(\varphi(x) = (2\pi)^{-1/2} e^{-x^{2} /2}\) one has φ′(x)=−(x) and the maximum value of φ(t) on the segment [x,x+Δ], |x|≤c, differs from the minimum value by not more than the factor exp{+Δ 2/2}. Therefore, for |x|≤c, one has by virtue of (5.3.6)

Since +Δ 2/2<1/2+1/8, \(e^{c\varDelta +{\varDelta^{2} }/{2}} \le2\), we have the representation

$$\varDelta\varphi(x) =\int_{x}^{x+\varDelta} \varphi(t) \,dt\, (1+\vartheta_1 \varDelta c),\quad |\vartheta_1 |<1. $$

Substituting this into (5.3.5) we obtain the assertion of the theorem. □

Thus by Theorem 5.3.1 the difference

$$ \bigl| \mathbf{P} ( x \le\zeta_n <y )- \bigl( \varPhi(y) - \varPhi(x) \bigr) \bigr| $$
(5.3.7)

can be effectively, yet rather roughly, bounded from above by a quantity of the order \(1/\sqrt{npq}\) if x=a, y=b (assuming that a and b are values which can be represented in the form (knp)Δ, see (5.3.2)). If x and y do not belong to the mentioned lattice with the span Δ then the error (5.3.7) will still be of the same order since, for instance, when y varies, P(xζ n <y) remains constant on the semi-intervals of the form (a+,a+(k+1)Δ], while the function Φ(y)−Φ(x) increases monotonically with a bounded derivative. A similar argument holds for the left end point x. It is important to note that the error order \(1/\sqrt{npq}\) cannot be improved, for the jumps of the distribution function of ζ n are just of this order of magnitude by Theorem 5.2.2.

Theorem 5.3.1 enables one to use the normal approximation for P(xζ n <y) in the so-called large deviations range as well, when both x and y grow in absolute value and are of the same sign. In that case, both Φ(y)−Φ(x) and the probability to be approximated tend to zero. Therefore the approximation can be considered satisfactory only if

$$ \frac{\mathbf{P} ( x \le\zeta_n <y )}{ ( \varPhi(y) -\varPhi(x) )} \to1. $$
(5.3.8)

As Theorem 5.3.1 shows, this convergence will take place if

$$c=\max\bigl(|x|,|y|\bigr)=o\bigl(\varDelta^{-1/3 }\bigr) $$

or, which is the same, c=o(n 1/6). For more details about large deviation probabilities, see Chap. 9.

For larger values of c, as one could verify using Theorem 5.2.1, relation (5.3.8) will, generally speaking, not hold.

In conclusion we note that since

$$\mathbf{P}\bigl(|\zeta_n | > b\bigr) \to0 $$

as b→∞, it follows immediately from Theorem 5.3.1 that, for any fixed y,

$$\lim_{n \to\infty} \mathbf{P}(\zeta_n < y ) = \varPhi(y). $$

Later we will show that this assertion remains true under much wider assumptions, when ζ n is a scaled sum of arbitrary distributed random variables having finite variances.

5.4 The Poisson Theorem and Its Refinements

5.4.1 Quantifying the Closeness of Poisson Distributions to Those of the Sums S n

As we saw from the bounds in the last section, the de Moivre–Laplace theorem gives a good approximation to the probabilities of interest if the number npq (the variance of S n ) is large. This number will grow together with n if p and q are fixed positive numbers. But what will happen in a problem where, say, p=0.001 and n=1000 so that np=1? Although n is large here, applying the de Moivre–Laplace theorem in such a problem would be meaningless. It turns out that in this case the distribution P(S n =k) can be well approximated by the Poisson distribution Π μ with an appropriate parameter value μ (see Sect. 5.4.2). Recall that

$$\boldsymbol{\Pi}_{\mu} (B) =\sum_{0 \le k \in B} e^{-\mu}\frac {\mu^k }{k!}. $$

Put np=μ.

Theorem 5.4.1

For all sets B,

$$\bigl|\mathbf{P}(S_n \in B) -\boldsymbol{\Pi}_{\mu} (B)\bigr| \le \frac{\mu^2}{n}. $$

We could prove this assertion in the same way as the local theorem, making use of the explicit formula for P(S n =k). However, we can prove it in a simpler and nicer way which could be called the common probability space method, or coupling method. The method is often used in research in probability theory and consists, in our case, of constructing on a common probability space random variables S n and \(S^{*}_{n}\), the latter being as close to S n as possible and distributed according to the Poisson distribution.

It is also important that the common probability space method admits, without any complications, extension to the case of non-identically distributed random variables, when the probability of getting 1 in a particular trial depends on the number of the trial. Thus we will now prove a more general assertion of which Theorem 5.4.1 is a special case.

Assume that we are given a sequence of independent random variables ξ 1,…,ξ n , such that \(\xi_{j} \mathbin {{\subset }\hspace {-.7em}{=}}\mathbf{B}_{p_{j} }\). Put, as above, \(S_{n} = \sum_{j=1}^{n} \xi_{j}\). The theorem we state below is intended for approximating the probability P(S n =k) when p j are small and the number \(\mu= \sum_{j=1}^{n} p_{j}\) is “comparable” with 1.

Theorem 5.4.2

For all sets B,

$$\bigl|\mathbf{P}(S_n \in B) -\boldsymbol{\Pi}_{\mu} (B)\bigr| \le \sum_{j=1}^{n} p_j^2 . $$

To prove this theorem we will need an important “stability” property of the Poisson distribution.

Lemma 5.4.1

If η 1 and η 2 are independent, \(\eta_{2} \mathbin {{\subset }\hspace {-.7em}{=}}\boldsymbol {\Pi}_{\mu_{1} }\) and \(\eta_{2} \mathbin {{\subset }\hspace {-.7em}{=}}\boldsymbol{\Pi}_{\mu_{2} }\), then Footnote 3

$$\eta_1 + \eta_2 \mathbin {{\subset }\hspace {-.7em}{=}}\boldsymbol{\Pi}_{\mu_1 + \mu_2}.$$

Proof

By the total probability formula,

$$\begin{aligned} \mathbf{P}(\eta_1 + \eta_2 =k) =&\sum _{j=0}^{k}\mathbf{P}(\eta_1 =j) \mathbf{P}( \eta_2 =k-j) \\=&\sum_{j=0}^{k} \frac{\mu_1^j e^{-\mu_1}}{j!}\cdot \frac{\mu_2^{k-j} e^{-\mu_2}}{(k-j)!}= \frac{1}{k!} e^{-(\mu_1 +\mu_2 )} \sum _{j=0}^{k} {k\choose j} \mu_1^j \mu_2^{k-j} \\=&\frac{(\mu_1 + \mu_2)^k e^{-(\mu_1 + \mu_2)}}{k!}. \end{aligned}$$

 □

Proof of Theorem 5.4.2

Let ω 1,…,ω n be independent random variables, each being the identity function (ξ(ω k )=ω k ) on the unit interval with the uniform distribution. We can assume that the vector ω=(ω 1,…,ω n ) is given as the identity function on the unit n-dimensional cube Ω with the uniform distribution.

Now construct the random variables ξ j and \(\xi_{j}^{*}\) on Ω as follows:

where \(\pi_{k} =\sum_{m \le k} e^{-p_{j}} \frac{(p_{j})^{m}}{m!}\), k=0,1,… .

It is evident that the ξ j (ω) are independent and \(\xi_{j} (\omega) \mathbin {{\subset }\hspace {-.7em}{=}}{\bf B}_{p_{j} }\); \(\xi_{j}^{*} (\omega)\) are also jointly independent with \(\xi_{j}^{*} (\omega) \mathbin {{\subset }\hspace {-.7em}{=}}\boldsymbol{\Pi}_{p_{j} }\). Now note that since \(1-p_{j} \le e^{-p_{j} }\) one has \(\xi_{j} (\omega) \ne\xi_{j}^{*} (\omega)\) only if \(\omega_{j} \in [ 1-p_{j},e^{-p_{j} } )\) or \(\omega_{j} \in [ e^{-p_{j} } + p_{j} e^{-p_{j} },1 ]\). Hence

$$\mathbf{P}\bigl(\xi_j \ne\xi_j^j\bigr)= \bigl(e^{-p_j } -1 +p_j \bigr)+\bigl(1 -e^{-p_j}-p_j e^{-p_j} \bigr) =p_j\bigl(1-e^{-p_j}\bigr)\le p_j^2 $$

and

$$\mathbf{P}\bigl(S_n \ne S_n^* \bigr)\le\mathbf{P} \biggl( \bigcup_j \bigl|\xi_j \ne \xi_j^*\bigr| \biggr) \le\sum p_j^2 , $$

where \(S_{n}^{*} = \sum_{j=1}^{n} \xi_{j}^{*} \mathbin {{\subset }\hspace {-.7em}{=}}\boldsymbol{\Pi}_{\mu}\).

Now we can write

$$\begin{aligned} \mathbf{P}(S_n \in B) =&\mathbf{P}\bigl(S_n \in B,S_n =S_n^* \bigr)+ \mathbf{P}\bigl(S_n \in B,S_n \ne S_n^* \bigr) \\=&\mathbf{P}\bigl(S_n^* \in B \bigr)-\mathbf{P} \bigl(S_n^* \in B,S_n \ne S_n^* \bigr)+ \mathbf{P}\bigl(S_n \in B,S_n \ne S_n^* \bigr), \end{aligned}$$

so that

(5.4.1)

The assertion of the theorem follows from this in an obvious way. □

Remark 5.4.1

One can give other common probability space constructions as well. One of them will be used now to show that there exists a better Poisson approximation to the distribution of S n .

Namely, let \(\xi_{j}^{*} (\omega)\) be independent random variables distributed according to the Poisson laws with parameters r j =−ln(1−p j )≥p j , so that \(\mathbf{P}(\xi_{j}^{*} =0)= e^{-r_{j} } =1-p_{j}\). Then \(\xi_{j} (\omega) = \min\{ 1,\xi_{j}^{*} (\omega)\} \mathbin {{\subset }\hspace {-.7em}{=}}{\bf B}_{p_{j} }\) and, moreover,

$$\mathbf{P} \Biggl(\, \bigcup_{j=1}^{n} \bigl \{\xi_j (\omega) \ne\xi_j^* (\omega )\bigr\} \Biggr) \le \sum_{j=1}^{n} \mathbf{P} \bigl( \xi_j^* (\omega) \ge2 \bigr) =\sum_{j=1}^{n} \bigl( 1-e^{-r_j }-r_j e^{-r_j } \bigr) . $$

But for r=−ln(1−p) one has the inequality

$$\begin{aligned} 1-e^{-r} -re^{-r} =&p +(1-p)\ln(1-p) \le p +(1-p) \biggl( -p- \frac{p^2}{2} \biggr)\\=&\frac{p^2}{2} (1+p). \end{aligned}$$

Hence for the new Poisson approximation we have

$$\mathbf{P}\bigl(S_n^* \ne S_n \bigr) \le \frac{1}{2} \sum_{j=1}^{n} p_j^2 (1+p_j ). $$

Putting \(\lambda=-\sum_{j=1}^{n} \ln(1-p_{j} ) \ge\sum_{j=1}^{n} p_{j}\), the same argument as above will lead to the bound

$$\sup_{B} \bigl|\mathbf{P}(S_n \in B) - \boldsymbol{\Pi}_{\lambda} (B) \bigr| \le\frac{1}{2} \sum_{j=1}^{n} p_j^2 (1+p_j ). $$

This bound of the rate of approximation given by the Poisson distribution with a “slightly shifted” parameter is better than that obtained in Theorem 5.4.2. Moreover, one could note that, in the new construction, \(\xi_{j} \le\xi_{j}^{*}\), \(S_{n} \le S_{n}^{*} \), and consequently

$$\mathbf{P}(S_n \ge k) \le\mathbf{P}\bigl(S_n^* \ge k \bigr) =\boldsymbol{\Pi }_{\lambda}\bigl([k,\infty)\bigr). $$

5.4.2 The Triangular Array Scheme. The Poisson Theorem

Now we will return back to the case of identically distributed ξ k . To obtain from Theorem 5.4.2 a limit theorem of the type similar to that of the de Moivre–Laplace theorem (see (5.3.1)), one needs a somewhat different setup. In fact, to ensure that np remains bounded as n increases, p=P(ξ k =1) needs to converge to zero which cannot be the case when we consider a fixed sequence of random variables ξ 1,ξ 2,… .

We introduce a sequence of rows (of growing length) of random variables:

$$\begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} \xi_1^{(1)};&&&\\\xi_1^{(2)},&\xi_2^{(2)};&&\\\xi_1^{(3)},&\xi_2^{(3)},&\xi_1^{(1)};&\\\ldots& \ldots&\ldots&\ldots& \ldots\\\xi_1^{(n)},&\xi_2^{(n)},&\xi_3^{(n)},&\ldots,&\xi_n^{(n)}. \end{array} $$

This is the so-called triangular array scheme. The superscript denotes the row number, while the subscript denotes the number of the variable in the row.

Assume that the variables \(\xi_{k}^{(n)}\) in the n-th row are independent and \(\xi_{k}^{(n)} \mathbin {{\subset }\hspace {-.7em}{=}}{\bf B}_{p_{n} }\), k=1,…,n.

Corollary 5.4.1

(The Poisson theorem)

If np n μ>0 as n→∞ then, for each fixed k,

$$ \mathbf{P}(S_n =k) \to\boldsymbol{\Pi}_{\mu}\bigl(\{ k\}\bigr), $$
(5.4.2)

where \(S_{n} =\xi_{1}^{(n)} + \cdots+ \xi_{n}^{(n)}\).

Proof

This assertion is an immediate corollary of Theorem 5.4.1. It can also be obtained directly, by noting that it follows from the equality

$$\mathbf{P}(S_n =k)=\binom{n}{ k} p^k (1-p)^{n-k} $$

that

$$\begin{aligned} \mathbf{P}(S_n =0) = e^{n\ln(1-p)} \sim e^{- \mu},\qquad \frac{\mathbf{P}(S_n =k+1)}{\mathbf{P}(S_n =k)}=\frac{n-k}{k+1} \frac{p}{1-p} \sim\frac{\mu}{k+1}. \end{aligned}$$

 □

Theorem 5.4.2 implies an analogue of the Poisson theorem in a more general case as well, when the \(\xi_{j}^{(n)}\) are not necessarily identically distributedFootnote 4 and can take values different from 0 and 1.

Corollary 5.4.2

Assume that \(p_{jn}=\mathbf{P}(\xi_{j}^{(n)} =1)\) depend on n and j so that

$$\max_{j} p_{jn} \to0,\qquad \sum _{j=1}^{n} p_{jn} \to\mu>0, \qquad \mathbf{P}\bigl(\xi_j^{(n)} =0\bigr) =1-p_{jn} +o(p_{jn}). $$

Then (5.4.2) holds.

Proof

To prove the corollary, one has to use Theorem 5.4.2 and the fact that

$$\mathbf{P} \Biggl(\, \bigcup_{j=1}^{n} \bigl \{ \xi_j^{(n)}\ne0, \xi _j^{(n)} \ne1 \bigr\} \Biggr) \le\sum_{j=1}^{n} o(p_{jn} ) =o(1), $$

which means that, with probability tending to 1, all the variables \(\xi_{j}^{(n)}\) assume the values 0 and 1 only. □

One can clearly obtain from Theorems 5.4.1 and 5.4.2 somewhat stronger assertions than the above. In particular,

$$\sup_B \bigl| \mathbf{P}(S_n \in B) - \boldsymbol{\Pi}_{\mu} (B) \bigr|\to0\quad\mbox{as } n\to\infty. $$

Note that under the assumptions of Theorem 5.4.1 this convergence will also take place in the case where np→∞ but only if np 2→0. At the same time, the refinement of the de Moivre–Laplace theorem from Sect. 5.3 shows that the normal approximation for the distribution of S n holds if np→∞ (for simplicity we assume that pq so that \(npq \geq \frac{1}{2} np \to\infty\)).

Thus there exist sequences p∈{p: np→∞, np 2→0} such that both the normal and the Poisson approximations are valid. In other words, the domains of applicability of the normal and Poisson approximations overlap.

We see further from Theorem 5.4.1 that the convergence rate in Corollary 5.4.1 is determined by a quantity of the order of n −1. Since, as n→∞,

$$\mathbf{P}(S_n =0)-\boldsymbol{\Pi}_{\mu}\bigl(\{ 0\} \bigr) =e^{n \ln(1-p)} -e^{-\mu} \sim\frac{\mu^2}{2\pi} e^{-\mu}, $$

this estimate cannot be substantially improved. However, for large k (in the large deviations range, say) such an estimate for the difference

$$\mathbf{P}(S_n =k)- \boldsymbol{\Pi}_{\mu} \bigl( \{ k \} \bigr) $$

becomes rough. (This is because, in (5.4.1), we neglected not only the different signs of the correction terms but also the rare events {S n =k} and \(\{ S_{n}^{*} =k \} \) that appear in the arguments of the probabilities.) Hence we see, as in Sect. 5.4, the necessity for having approximations of which both absolute and relative errors are small.

Now we will show that the asymptotic equivalence relations

$$\mathbf{P}(S_n =k) \sim\boldsymbol{\Pi}_{\mu}\bigl(\{ k\} \bigr) $$

remain valid when k and μ grow (along with n) in such a way that

$$k=o\bigl(n^{2/3}\bigr),\qquad\mu=o\bigl(n^{2/3}\bigr) ,\qquad|k- \mu|=o(\sqrt{n} \,). $$

Proof

Indeed,

$$\begin{aligned} \mathbf{P}(S_n =k) =&\binom{n}{k}p^k (1-p)^{n-k} =\frac{n(n-1) \cdots (n-k+1)}{k!} p^k (1-p)^{n-k} \\=&\frac{(nk)^k}{k!}e^{-pn} \biggl( 1-\frac{1}{n} \biggr) \cdots \biggl( 1-\frac{k-1}{n} \biggr) (1-p)^{n-k} e^{pn}\\=& \boldsymbol{\Pi}_{\mu} \bigl(\{k \}\bigr) e^{\varepsilon(k,n)}. \end{aligned}$$

Thus we have to prove that, for values of k and μ from the indicated range,

$$ \varepsilon(k,n):=\ln \biggl[ \biggl( 1-\frac{1}{n} \biggr) \cdots \biggl( 1-\frac{k-1}{n} \biggr) (1-p)^{n-k} e^{pn} \biggr]=o(1). $$
(5.4.3)

We will obtain this relation together with the form of the correction term. Namely, we will show that

$$ \varepsilon(k,n)= \frac{k-(k-\mu)^2}{2n} +O \biggl( \frac{k^3 + \mu^3}{n^2} \biggr) , $$
(5.4.4)

and hence

$$\mathbf{P}(S_n =k) = \biggl(1+ \frac{k-(k-\mu)^2}{2n} +O \biggl( \frac{k^3 + \mu^3}{n^2} \biggr) \biggr) \boldsymbol{\Pi}_{\mu} \bigl(\{k \} \bigr). $$

We make use of the fact that, as α→0,

$$\ln(1-\alpha)=-\alpha-\frac{\alpha^2 }{2} +O\bigl(\alpha^3 \bigr). $$

Then relations (5.4.3) and (5.4.4) will follow from the equalities

$$\begin{aligned} \sum_{j=1}^{k-1} \ln \biggl(1- \frac{j}{n} \biggr) =&-\sum_{j=1}^{k-1} \frac{j}{n}+ O \biggl( \frac{k^3}{n^2} \biggr) = -\frac{k(k-1)}{2n} +O \biggl( \frac{k^3}{n^2} \biggr), \\(n-k)\ln(1-p) +pn =& (n-k) \biggl( -p -\frac{p^2}{2} +O \bigl(p^3\bigr) \biggr) +pn \\=& -\frac{\mu^2 }{2n} +\frac{k\mu}{n} + O \biggl( \frac{\mu^3 }{n^2} \biggr). \end{aligned}$$

 □

In conclusion we note that the approximate Poisson formula

$$\mathbf{P}(S_n =k) \approx\frac{\mu^k }{k!} e^{-\mu} $$

is widely used in various applications and has, as experience and the above estimates show, a rather high accuracy even for moderate values of n.

Now we consider several examples of the use of the de Moivre–Laplace and Poisson theorems for approximate computations.

Example 5.4.1

Suppose we are given 104 packets of grain. It is known that there are 5000 tagged grains in the packets. What is the probability that, in a particular fixed packet, there is at least one tagged grain? We can assume that the tagged grains are distributed to packets at random. Then the probability that a particular tagged grain will be in the chosen packet is p=10−4. Since there are 5000 such grains, this will be the number of trials, i.e. n=5000. Define a random variable ξ k as follows: ξ k =1 if the k-th grain is in the chosen packet, and ξ k =0 otherwise. Then

$$S_{5000} =\sum_{k=1}^{5000} \xi_k $$

will be the number of tagged grains in our packet. By Theorem 5.4.1, P(S 5000=0)≈e np=e −0.5 so that the desired probability is approximately equal to 1−e −0.5. The accuracy of this relation turns out to be rather high (by Theorem 5.4.1, the error does not exceed 2−1×10−4). If we used the Poisson theorem instead of Theorem 5.4.1, we would have to imagine a triangular array of Bernoulli random variables, our ξ k constituting the 5000-th row of the array. Moreover, we would assume that, for the n-th row, one has np n =0.5. Thus the conditions of the Poisson theorem would be met and we could make use of the limit theorem to find the approximate equality we have already obtained.

Example 5.4.2

A similar argument can be used in the following problem. There are n dangerous bacteria in a reservoir of capacity V from which we take a sample of volume vV. What is the probability that we will find the bacteria in the test sample?

One usually assumes that the probability p that any given bacterium will be in the test sample is equal to the ratio v/V. Moreover, it is also assumed that the presence of a given bacterium in the sample does not depend on whether the remaining n−1 bacteria are in the test sample or not. In other words, one usually postulates that the mechanism of bacterial transfer into the test sample is equivalent to a sequence of n independent trials with “success” probability equal to p=v/V in each trial.

Introducing random variables ξ k as above, we obtain a description of the number of bacteria in the test sample by the sum \(S_{n} = \sum_{k=1}^{n} \xi_{k}\) in the Bernoulli scheme. If nv is comparable in magnitude with V then by the Poisson theorem the desired probability will be equal to

$$\mathbf{P}(S_n >0) \approx1-e^{-nv/V}. $$

Similar models are also used to describe the number of visible stars in a certain part of the sky far away from the Milky Way. Namely, it is assumed that if there are n visible stars in a region R then the probability that there are k visible stars in a subregion rR is

$$\binom{n}{k} p^k (1-p)^k , $$

where p is equal to the ratio S(r)/S(R) of the areas of the regions r and R respectively.

Example 5.4.3

Suppose that the probability that a newborn baby is a boy is constant and equals 0.512 (see Sect. 3.4.1).

Consider a group of 104 newborn babies and assume that it corresponds to a series of 104 independent trials of which the outcomes are the events that either a boy or girl is born. What is the probability that the number of boys among these newborn babies will be greater than the number of girls by at least 200?

Define random variables as follows: ξ k =1 if the k-th baby is a boy and ξ k =0 otherwise. Then \(S_{n} = \sum_{k=1}^{10^{4}} \xi_{k}\) is the number of boys in the group. The quantity npq∼2.5×103 is rather large here, hence applying the integral limit (de Moivre–Laplace) theorem we obtain for the desired probability the value

$$\begin{aligned} \mathbf{P}(S_n \ge5100) =&1-\mathbf{P} \biggl( \frac{S_n -np}{\sqrt {npq}} < \frac{5100-5120}{\sqrt{2500}} \biggr) \\\approx &1-\varPhi(-20/50) =1-\varPhi(-0.4) \approx0.66 . \end{aligned}$$

To find the numerical values of Φ(x) one usually makes use of suitable statistical computer packages or calculators.

In our example, \(\varDelta=1/\sqrt{npq}\approx1/50\), and a satisfactory approximation by the de Moivre–Laplace formula will certainly be ensured (see Theorem 5.3.1) for c≤2.5.

If, however, we have to estimate the probability that the proportion of boys exceeds 0.55, we will be dealing with large deviation probabilities when to estimate P(S n >5500) one would rather use the approximate relation obtained in Sect. 1.3 by virtue of which (k=0.45n, q=0.488) one has

$$\mathbf{P}(S_n > 5500 ) \approx\frac{(n+1-k)q}{(n+1)q-k} \mathbf {P}(S_n = 5500 ). $$

Applying Theorem 5.2.1 we find that

$$\mathbf{P}(S_n > 5500 ) \approx\frac{0.55q}{q-0.45} \frac{1}{\sqrt {2\pi n0.25}} e^{-nH(0.55)} \le\frac{1}{5} e^{-25} < 10^{-11}. $$

Thus if we assume for a moment that 100 million babies are born on this planet each year and group them into batches of 10 thousand, then, to observe a group in which the proportion of boys exceeds the mean value by just 3.8 % we will have to wait, on average, 10 million years (see Example 4.1.1 in Sect. 4.1).

It is clear that the normal approximation can be used for numerical evaluation of probabilities for the problems from Example 5.4.3 provided that the values of np are large.

5.5 Inequalities for Large Deviation Probabilities in the Bernoulli Scheme

In conclusion of the present chapter we will derive several useful inequalities for the Bernoulli scheme. In Sect. 5.2 we introduced the function

$$H(x) =x \ln\frac{x}{p} +(1-x) \ln\frac{1-x}{1-p}, $$

which plays an important role in Theorems 5.2.1 and 5.2.2 on the asymptotic behaviour of the probability P(S n =k). We also considered there the basic properties of this function.

Theorem 5.5.1

For z≥0,

$$ \begin{aligned}[c] \mathbf{P}(S_n -np \ge z) &\le \exp \bigl\{ {-}nH ( p+z/{n} ) \bigr\} ,\\\mathbf{P}(S_n -np \le-z) &\le \exp \bigl\{ {-}nH ( p-{z}/{n} ) \bigr\}. \end{aligned} $$
(5.5.1)

Moreover, for all p,

$$ H(p+x) \ge2x^2, $$
(5.5.2)

so that each of the probabilities in (5.5.1) does not exceed exp{−2z 2/n} for any p.

To compare it with assertion (5.2.2) of Theorem 5.2.1, the first inequality from Theorem 5.5.1 can be re-written in the form

$$\mathbf{P} \biggl( \frac{S_n}{n} \ge p^* \biggr) \le\exp\bigl\{ {-}nH\bigl(p^* \bigr) \bigr\}. $$

The inequalities (5.5.1) are close, to some extent, to the de Moivre–Laplace theorem since, for z=o(n 2/3),

$$-nH \biggl( p+\frac{z}{n} \biggr) =-\frac{z^2 }{2npq} +o(1). $$

The last assertion, together with (5.5.2), can be interpreted as follows: deviating by z or more from the mean value np has the maximum probability when p=1/2.

If \(z/\sqrt{n} \to\infty\), then both probabilities in (5.5.1) converge to zero as n→∞ for they correspond to large deviations of the sum S n from the mean np. As we have already said, they are called large deviation probabilities.

Proof of Theorem 5.5.1

In Corollary 4.7.2 of the previous chapter we established the inequality

$$\mathbf{P}(\xi\ge x) \le e^{-\lambda x} \mathbf{E}e^{\lambda\xi}. $$

Applying it to the sum S n we get

$$\mathbf{P}(S_n \ge np +z) \le e^{-\lambda(np+z)} \mathbf {E}e^{\lambda S_n }. $$

Since \(\mathbf{E}e^{\lambda S_{n} } = \prod_{k=1}^{n} \mathbf {E}e^{\lambda\xi_{k} }\) and the random variables \(e^{\lambda\xi_{k}}\) are independent,

The expression in brackets is equal to

$$\mathbf{E}e^{-\lambda[\xi_k - (p+\alpha)]}= pe^{\lambda(1-p-\alpha)} + (1-p)e^{-\lambda(p+\alpha)}. $$

Therefore, being the sum of two convex functions, it is a convex function of λ. The equation for the minimum point λ(α) of the function has the form

$$-(p-\alpha) \bigl(1+p\bigl(e^{\lambda} -1\bigr)\bigr) +pe^{\lambda} =0, $$

from which we find that

The first of the inequalities (5.5.1) is proved. The second inequality follows from the first if we consider the latter as the inequality for the number of zeros.

It follows further from (5.2.1) that H(p)=H′(p)=0 and H″(x)=1/x(1−x). Since the function x(1−x) attains its maximum value on the interval [0,1] at the point x=1/2, one has H″(x)≥4 and hence

$$\begin{aligned} H(p+\alpha) \ge\frac{\alpha^2 }{2} \cdot4 =2\alpha^2 . \end{aligned}$$

 □

For analogues of Theorem 5.5.1 for sums of arbitrary random variables, see Chap. 9 and Appendix 8. Example 9.1.2 shows that the function H(α) is the so-called deviation function for the Bernoulli scheme. This function is important in describing large deviation probabilities.