This chapter includes two applications of the large deviation theory presented in Chapter 21. One concerns an application to a problem in cryptography in which, among other motivations, hackers attempt to break a password by guessing. The other is an application to the efficiency of large sample statistical tests of hypothesis.

FormalPara Example 1 (Encrypted Security Systems)

The problem to be considered here is of interest to cryptographers analyzing, for example, attempts by a hacker to enter a password protected system by robotically guessing it. The problem can be abstractly stated as follows: For a given finite set S = {1, 2, …k}, say, Alice randomly generates a cipher X (n) = x ∈ S n of length n, where X (n) = (X 1, …, X n) ∈ S n has a joint probability mass function \(p_{X^{(n)}}(x_1,\dots ,x_n), x_j\in S, 1\le j\le n\). According to some guessing strategy, Bob systematically steps through the messages y ∈ S n in some specified order, and Alice responds X (n) = y with “yes” or “no,” according to whether y = x or not. The goal is to quantify the effort required by guessing. Throughout it will be assumed without further mention that the message source X 1, X 2, … is a stationary process.

Mathematically, guessing is given by a bijection G : S n →{1, 2, …, |S|n} prescribing the orders in which guesses y ∈ S n are made in the guessing strategy. G(x) is then the number of guesses to reach the given cipher x.

To minimize the expected number of guesses, an optimal choice is a guessing function G that would therefore make the order of selections according to decreasing probabilities f(y), y ∈ S n. Note that if f(y) = f(z), then the order in which y and z are guessed will not affect the number of guesses to unlock the password. In particular, an optimal G is not unique with regard to minimizing the expected number of guesses.

As a measure of the attackers effort, cryptologists consider an optimal G to define an optimal guessing exponent by

$$\displaystyle \begin{aligned} g(\rho) = \lim_{n\to\infty}{1\over n}\ln{\mathbb E}G^*(X^{(n)})^\rho, \end{aligned} $$
(22.1)

when the limit exists. The primary focus of this chapter is on the computation of g(ρ) in some generality via large deviation theory. This is achieved by systematically establishing a succession of equivalent computations: Proposition 22.2 recasts the problem in terms of an equivalent computation for word lengths, Proposition 22.3 recasts this in terms of a Rényi entropy computation, and finally Theorem 22.4, Corollary 22.5 in terms of a large deviation computation for the so-called information spectrum.

FormalPara Remark 22.1

Calculations have been made for g(ρ) in the case of i.i.d. encodings X 1, …, X n by Arikan (1996), and irreducible Markov chain encodings by Malone and Sullivan (2004). These will appear as applications of the large deviation results of Hanawal and Sundaresan (2011) at the end of this example.

It will be helpful to introduce the guessing length function L G : S n →N associated with G defined by

$$\displaystyle \begin{aligned} L_G({\mathbf x}) = \lceil-\ln {1\over CG({\mathbf x})} \rceil, \quad {\mathbf x}\in S^n, \end{aligned} $$
(22.2)

where ⌈⋅⌉ is the ceiling function, i.e., ⌈x⌉ is the smallest integer not smaller than x, and \(C =\sum _{{\mathbf x}\in S^n}{1\over G({\mathbf x})}\) is a normalization constant. In particular,

$$\displaystyle \begin{aligned} Q_G(x) = {1\over CG(x)}, \quad x\in S^n, \end{aligned} $$
(22.3)

defines a probability mass function on S n. Note that since C ≥ 1,

$$\displaystyle \begin{aligned} G(x) = {1\over CQ_G(x)} \le {1\over Q_G(x)}. \end{aligned} $$
(22.4)

Clearly, \(\ln G(x)\le L_G(x), x\in S^n\) by definition, and

$$\displaystyle \begin{aligned} \ln G(x) = -\ln Q_G(x) -\ln C \ge \lceil -\ln Q_G(x)\rceil-1-\ln C, \end{aligned} $$
(22.5)

so that, in summary,

$$\displaystyle \begin{aligned} L_G(x)-1-\ln C\le \ln G(x)\le L_G(x). \end{aligned} $$
(22.6)

To denote the dependence of G and L on the message length n, we write G n, L n, C n, respectively, when necessary. Note that L n satisfies the so-called Kraft inequality (Exercise 4)

$$\displaystyle \begin{aligned} \sum_{x\in S^n}\exp\{-L_n(x)\}\le 1. \end{aligned} $$
(22.7)

In general any function \(L:S^n\to {\mathbb N}\) satisfying the Kraft inequality will be referred to as a length function. We let \({\mathcal L}_n\) denote the set of all such functions on S n. L will denote a length function that minimizes \({\mathbb E}\exp \{\rho L(X^{(n)})\}\).

Suppose that X 1, X 2, … is a stationary process and let \(Q\in {\mathcal P}_n\) denote the distribution of (X 1+m, …, X n+m) (m=1,2,…). The Shannon entropy Footnote 1 expressed in nats, i.e., using natural logarithms, is defined by

$$\displaystyle \begin{aligned}H(X_1,\dots,X_n) \equiv H(Q) = -\sum_{x\in S^n}Q(\{x\})\ln Q(\{x\}).\end{aligned}$$

Shannon’s entropy of the stationary process is defined by

$$\displaystyle \begin{aligned}H=\lim_{n\to\infty}{H(X_1,X_2,\dots,X_n)\over n},\end{aligned}$$

for which existence is a direct consequence of subadditivity using Fekete’s lemma from Chapter 5. Specifically, letting Q n denote the distribution of (X 1, X 2, …, X n), one has

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} & &\displaystyle H(Q_{n+m})\equiv H(X_1,\dots,X_{n+m})\\ & &\displaystyle \quad \le H(X_1,\dots,X_n) + H(X_{n+1},\dots,X_{n+m})\\ & &\displaystyle \quad = H(X_1,\dots,X_n) + H(X_1,\dots,X_m) = H(Q_n)+H(Q_m), \end{array} \end{aligned} $$
(22.8)

where the essential second line is left as Exercise 2.

FormalPara Remark 22.2

Note the existence of a length function L for which the (approximate) expected lengths are minimal, i.e., the problem

$$\displaystyle \begin{aligned}\min_{L\in{\mathcal L}_n:\sum_xe^{-L(x)}\le 1} \sum_{x\in S^n}p_{X^{(n)}}(x)L(x)\end{aligned}$$

can be shown to have a solution by the method of Lagrange multipliers (Exercise 5) providing one permits non-integer solutions.

FormalPara Theorem 22.1 (Shannon)

For a length function L one has

$$\displaystyle \begin{aligned}{H(X_1,\dots,X_n)}\le { {\mathbb E}L(X_1,\dots,X_n)},\end{aligned}$$

with equality if and only if \(p_{X^{(n)}}(x) = e^{-L(x)}\). Moreover, letting L (X 1, …, X n) denote the lengths having smallest expected value possible for the word (X 1, …, X n), one has

$$\displaystyle \begin{aligned}{H(X_1,\dots,X_n)}\le { {\mathbb E}L^*(X_1,\dots,X_n)} \le{H(X_1,\dots,X_n)}+{1}.\end{aligned}$$

In particular,

$$\displaystyle \begin{aligned}{H(X_1,\dots,X_n)\over n}\le { {\mathbb E}L^*(X_1,\dots,X_n)\over n} \le {H(X_1,\dots,X_n)\over n}+{1\over n},\end{aligned}$$
$$\displaystyle \begin{aligned}\lim_{n\to\infty}{\mathbb E}{L^*(X_1,\dots,X_n)\over n} = H.\end{aligned}$$
FormalPara Proof

To prove the lower bound let \(q(x) = {e^{-L(x)}\over \sum _{y\in S^n}e^{-L(y)}}\), and \(K = \sum _{y\in S^n}e^{-L(y)}\le 1\), by the Kraft inequality. Then,

$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle {\mathbb E}L(X_1,\dots,X_n) - H(X_1,\dots,X_n)\\ & &\displaystyle \quad = \sum_{x\in S^n}p_{X^{(n)}}(x)L(x) - \sum_{x\in S^n}p_{X^{(n)}}(x)\ln{1\over p_{X^{(n)}}(x)}\\ & &\displaystyle \quad = -\sum_{x\in S^n}p_{X^{(n)}}(x)\ln e^{-L(x)} + \sum_{x\in S^n}p_{X^{(n)}}(x)\ln p_{X^{(n)}}(x)\\ & &\displaystyle \quad = \sum_{x\in S^n}p_{X^{(n)}}(x)\ln{p_{X^{(n)}}(x)\over q(x)} - \ln K\\ & &\displaystyle \quad = D(p_{X^{(n)}}||q) + \ln{1\over K} \ge 0. \end{array} \end{aligned} $$
(22.9)

Note that approximately if \(L(x) = \ln {1\over p_{X^{(n)}}(x)}\), then H = L. However, such a choice for L is not an integer. Taking \(L(x) = \lceil \ln {1\over p_{X^{(n)}}(x)}\rceil \), the Kraft inequality is preserved by this choice Now, for this choice of lengths, a simple calculation yields,

$$\displaystyle \begin{aligned}H(X_1,\dots,X_n)\le {\mathbb E}L(X_1,\dots,X_n) \le H(X_1,\dots,X_n) +1.\end{aligned}$$

Since \({\mathbb E}L^*(X_1,\dots ,X_n)\le {\mathbb E}L(X_1,\dots ,X_n)\) both the lower and upper bounds are satisfied by \({\mathbb E}L^*(X_1,\dots , X_n)\).\(\hfill \blacksquare \)

FormalPara Lemma 1

Let G be a guessing function and L G its associated length function. Then,

$$\displaystyle \begin{aligned} \left|{1\over\rho}\ln {\mathbb E}G^*(X^{(n)})^\rho-{1\over\rho} \ln {\mathbb E}\exp\left\{\rho L^*(X^{(n)})\right\}\right|\le 1+\ln C, \end{aligned} $$
(22.10)

where \(C =\sum _{{\mathbf x}\in S^n}{1\over G({\mathbf x})}\).

FormalPara Proof

For a length function \(L\in {\mathcal L}_n\), let G L be the guessing function that guesses in the increasing order of L-lengths. Messages of the same L-length are ordered according to an arbitrary fixed rule, say lexicographical order on S n. Define a probability mass function on S n by

$$\displaystyle \begin{aligned} Q_L(x) = {\exp\{-L(x)\}\over\sum_{y\in S^n}\exp\{-L(y)\}}, \quad x\in S^n. \end{aligned} $$
(22.11)

Note that G L guesses in the decreasing order of Q L probabilities. In particular, \(G_L(x)\le \sum _{y\in S^n}{\mathbf {1}}[Q_L(y)\ge Q_L(x)] \le \sum _{y\in S^n}{Q_L(y)\over Q_L(x)} = {1\over Q_L(x)}\), so that

$$\displaystyle \begin{aligned} \ln G_L(x)\le -\ln Q_L(x) \quad x\in S^n. \end{aligned} $$
(22.12)

Also, by definition of Q L and using Kraft’s inequality (22.7),

$$\displaystyle \begin{aligned} {1\over Q_L(x)} = \exp\{L(x)\}\sum_{y\in S^n}\exp\{-L(y)\} \le \exp\{L(x)\},\end{aligned}$$

so that

$$\displaystyle \begin{aligned} -\ln Q_L(x)\le L(x), \quad x\in S^n. \end{aligned} $$
(22.13)

From these inequalities one deduces that for any B ≥ 1,

$$\displaystyle \begin{aligned} \{x:L_G(x)\ge B+1+\ln C\}\subset\{x:G(x)\ge e^B\}\subset\{x:L_G(x)\ge B\}, \end{aligned} $$
(22.14)

and

$$\displaystyle \begin{aligned} \{x:G_L(x)\ge e^B\}\subset \{x: L(x)\ge B\}. \end{aligned} $$
(22.15)

Now, by (22.12) followed by (22.6),

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {\mathbb E}\exp\{\rho L(X^{(n)})\}& \ge&\displaystyle {\mathbb E}G_L(X^{(n)})^\rho\ge {\mathbb E}G^*(X^{(n)})^\rho\\ & \ge&\displaystyle {\mathbb E}\exp\{\rho L_{G^*}(X^{(n)})\}\exp\{-\rho(1+\ln C)\}\\ & \ge&\displaystyle {\mathbb E}\exp\{\rho L^*(X^{(n)})\}\exp\{-\rho(1+\ln C)\}. \end{array} \end{aligned} $$
(22.16)

Thus,

$$\displaystyle \begin{aligned} {{\mathbb E}G_L(X^{(n)})^\rho\over{\mathbb E}G^*(X^{(n)})^\rho} \le {{\mathbb E}\exp\{\rho L(X^{(n)})\}\over{\mathbb E}\exp\{\rho L^*(X^{(n)})\}} \exp\{\rho(1+\ln C)\}, \end{aligned} $$
(22.17)

and, in terms of the length function L G associated with G, one similarly has

$$\displaystyle \begin{aligned} {{\mathbb E}G(X^{(n)})^\rho\over{\mathbb E}G^*(X^{(n)})^\rho} \ge {{\mathbb E}\exp\{\rho L_G(X^{(n)})\}\over{\mathbb E}\exp\{\rho L^*(X^{(n)})\}} \exp\{-\rho(1+\ln C)\}. \end{aligned} $$
(22.18)

The lemma now follows from these bounds upon taking logarithms with L = L in (22.16). That is

$$\displaystyle \begin{aligned} 1 \ge {{\mathbb E}G^*(X^{(n)})^\rho\over{\mathbb E}\exp\{\rho L^*(X^{(n)})\}} \ge \exp\{-\rho(1+\ln C)\}, \end{aligned} $$
(22.19)

so that \( 0\ge \ln {\mathbb E}G^*(X^{(n)})^\rho - \ln {\mathbb E}\{\rho L^*(X^{(n)})\} \ge -\rho (1+\ln C)\).\(\hfill \blacksquare \)

The existence and determination of g(ρ) will ultimately follow from an application of Varadhan’s integral formula applied to a related function of X 1, …, X n obtained from the next three propositions and their lemmas.

FormalPara Proposition 22.2

The guessing exponent g(ρ) exists if and only if

$$\displaystyle \begin{aligned} \ell(\rho) = \lim_{n\to\infty}\inf_{L\in{\mathcal L}_n}{1\over n}\ln{\mathbb E}\exp\{\rho L(X^{(n)})\} \end{aligned} $$
(22.20)

exists. Moreover g(ρ) = (ρ) when either exists.

FormalPara Proof

Note that \(C_n\le 1+n\ln |S|\). Dividing both sides of the inequality in Lemma 1 by n, one has

$$\displaystyle \begin{aligned} \left|{1\over n\rho}\ln{\mathbb E}G^{*\rho}(X^{(n)})- {1\over n\rho}\ln{\mathbb E}(\exp\{\rho L^*(X^{(n)})\})\right|\le {1\over n}\left(1+\ln C_n\right) = O\left({\ln n\over n}\right). \end{aligned} $$
(22.21)

Thus the sequences differ by o(1) as n →.\(\hfill \blacksquare \)

The next proposition requires the Rényi entropy rate of order α ≠ 1 defined by

$$\displaystyle \begin{aligned} H_\alpha(p_{X^{(n)}}) = {1\over1-\alpha}\ln\sum_{{\mathbf x}\in S^n} p_{X^{(n)}}^{\alpha}({\mathbf x})) \equiv {1\over 1-\alpha}\ln{\mathbb E} p_{X^{(n)}}^{\alpha-1}(X^{(n)}). \end{aligned} $$
(22.22)
FormalPara Proposition 22.3

\(\lim _{n\to \infty }\inf _{L\in {\mathcal L}_n}{1\over n}\ln {\mathbb E}\exp \{\rho L(X^{(n)})\}\), or equivalently \(\lim _n\ln \) \({\mathbb E}G^*(X^{(n)})^\rho \), exists if and only if \(\lim _{n\to \infty }{1\over n}H_\alpha (p_{X^{(n)}})\) exists for \(\alpha = {1\over 1+\rho }\). Moreover, if the latter limit exists, then it is given by \({g(\rho )\over \rho }\).

FormalPara Proof

The equivalence is the content of Proposition 22.2. We focus on the former limit. For each n the Donsker–Varadhan variational formula of Corollary 21.17 yields, upon replacing g by L(X (n)), λ by ρ, Q by \(p_{X^{(n)}}\), and P by Q, that

$$\displaystyle \begin{aligned} \ln{\mathbb E}\exp\{\rho L(X^{(n)})\} = \sup_{Q\in{\mathcal P}_n}\{\rho{\mathbb E}_QL(X^{(n)}) - D(Q||p_{X^{(n)}})\}. \end{aligned} $$
(22.23)

Taking the infimum on both sides over all length functions \(L\in {\mathcal L}_n\) and applying Fan’s minimax exchange of supremum and infimum, one has

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \inf_{L\in{\mathcal L}_n}\ln{\mathbb E}\exp\{\rho L_n(X^{(n)})\} & =&\displaystyle \inf_{L\in{\mathcal L}_n}\sup_{Q\in{\mathcal P}_n}\{\rho{\mathbb E}_QL_n(X^{(n)}) - D(Q||p_{X^{(n)}})\}\\ & =&\displaystyle \sup_{Q\in{\mathcal P}_n}\inf_{L\in{\mathcal L}_n}\{\rho{\mathbb E}_QL_n(X^{(n)}) - D(Q||p_{X^{(n)}})\}\\ & =&\displaystyle \sup_{Q\in{\mathcal P_n}}\{\rho H(Q)-D(Q||p_{X^{(n)}})\} +O(1) \\ & =&\displaystyle \rho H_{1\over 1+\rho}(p_{X^{(n)}}) +O(1), \end{array} \end{aligned} $$
(22.24)

where to justify the use of Fan’s minimax formula one notes firstly convexity of the map \((Q,L)\in {\mathcal P}_n\times {\mathcal L}_n\to {\mathbb E}_Q\{\rho L(X^{(n)}) - D(Q||p_{X^{(n)}}) = \sum _{x\in S^n}\{\rho L(x)+\ln Q(x) -\ln p_{X^{(n)}}(x)\}Q(x)\), as a function of \(Q\in {\mathcal P}_n\), and the linearity as a function of L. The next equation follows from Theorem 22.1, namely \(\inf _{L\in {\mathcal L}_n}{\mathbb E}_{Q\in {\mathcal P}_n}\{L(X^{(n)})\} = H(Q) +O(1)\). Finally, the last equation follows by writing

$$\displaystyle \begin{aligned} \sup_{Q\in{\mathcal P}_n}\left\{\rho H(Q)-D(Q||p_{X^{(n)}})\right\} = (1+\rho)\sup_{Q\in{\mathcal P}_n} \left\{{\mathbb E}_Q\left[-{\rho\over 1+\rho}\ln p_{X^{(n)}}(X^{(n)})\right]\! -\!D(Q||p_{X^{(n)}})\right\}\!, \end{aligned} $$

and then applying the Donsker–Varadhan variational formula of Corollary 21.17, as in the first equation, with g replaced by \(\ln p_{X^{(n)}}(X^{(n)})\), λ replaced by \({1\over 1+\rho }\), P replaced by Q to get the scaled Rényi entropy. That is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle \sup_{Q\in{\mathcal P}_n}\{\rho H(Q)-D(Q||p_{X^{(n)}})\} +O(1)\\ & &\displaystyle \quad =\sup_{Q\in{\mathcal P}_n}\left\{-\rho\sum_xQ(x)\ln Q(x) -\sum_xQ(x)\ln Q(x) + \sum_xQ(x)\ln p_{X^{(n)}}(x)\right\}\\ & &\displaystyle \quad = \sup_{Q\in{\mathcal P}_n}\left\{\sum_xQ(x)\ln p_{X^{(n)}}(x) -(1+\rho)\sum_xQ(x)\ln Q(x)\right\}\\ & &\displaystyle \quad = (1+\rho)\sup_{Q\in{\mathcal P}_n} \left\{{\mathbb E}_Q{1\over 1+\rho}\ln p_{X^{(n)}}(X^{(n)}) - D(Q||p_{X^{(n)}})\right\} +O(1)\\ & &\displaystyle \quad = (1+\rho)\ln{\mathbb E}p_{X^{(n)}}^{{1\over 1+\rho}-1}(X^{(n)}) +O(1)\\ & &\displaystyle \quad = \rho H_{1\over 1+\rho}(p_{X^{(n)}}) + O(1). \end{array} \end{aligned} $$

Scale by \({1\over n}\) and let n → to complete the proof.\(\hfill \blacksquare \)

The information spectrum is defined by \(-{1\over n}\ln p_{X^{(n)}}(X^{(n)})\). The next step is to show that the Rényi entropy rate can be computed from the distributions of the information spectra.

FormalPara Theorem 22.4 (Hanawal and Sundaresan (2011))

Let ν n be the distribution of the information spectrum \(-{1\over n}\ln p_{X^{(n)}}(X^{(n)})\). If ν n, n ≥ 1, satisfy a LDP with rate function I, then the limiting Rényi entropy rate of order \(\alpha = {1\over 1+\rho }\) exists and is given by \(\beta ^{-1}\sup _{t\in {\mathbb R}}\{\beta t-I(t)\}\), where \(\beta ={\rho \over 1+\rho }\).

FormalPara Proof

Let ν n denote the distribution of the information spectrum \({1\over n}\ln p_{X^{(n)}}(X^{(n)})\). Then, with \(A_n = \{-{1\over n}\ln p_{X^{(n)}}(x): x\in S^n\}\), one has

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{\mathbb R}\exp(n\beta t)\nu_n(dt) & =&\displaystyle \sum_{t\in A_n}\exp(n\beta t)\sum_{\{x: p_{X^{(n)}}(x)= \exp(-nt)\}}p_{X^{(n)}}(x) \\ & =&\displaystyle \sum_{x\in S^n}p_{X^{(n)}}(x)^{1-\beta}\\ & =&\displaystyle \sum_{x\in S^n}p_{X^{(n)}}(x)^{1\over 1+\rho}\\ & =&\displaystyle \exp\{\beta H_{1\over 1+\rho}(p_{X^{(n)}})\}. \end{array} \end{aligned} $$
(22.25)

Now, scaling by \(1\over n\) and taking logarithms, one may apply the Varadhan integral formula to the left side to obtain in the limit \(\beta ^{-1}\sup _{t\in {\mathbb R}}\{\beta t - I(t)\}\), while one has on the right side \(\beta \lim _n{1\over n}H_{1\over 1+\rho }(p_{X^{(n)}})\).\(\hfill \blacksquare \)

FormalPara Corollary 22.5

If the distributions of the information spectra satisfies a LDP with rate I, then the guessing exponent exists and is given by

$$\displaystyle \begin{aligned}g(\rho) = (1+\rho)\sup_{t\in{\mathbb R}}\left\{{\rho\over 1+\rho}t -I(t)\right\}.\end{aligned}$$
FormalPara Proof

By Proposition 22.3 the limiting Rényi entropy is \({g(\rho )\over \rho }\). Thus, one has \(g(\rho ) = \rho \beta ^{-1}\sup _{t\in {\mathbb R}}\{\beta t - I(t)\} = (1+\rho )\sup _{t\in {\mathbb R}}\{\beta t - I(t)\}\).\(\hfill \blacksquare \)

First let us apply this theory to the case of i.i.d. message sources.

FormalPara Theorem 22.6 (I.I.D. Case)

Assume that X 1, X 2, … is i.i.d. with common probability mass function p on the finite alphabet S. Then, the limit defining the guessing exponent g(ρ) exists and is given by g(ρ) = (1 + ρ)H 1 1+ρ(p), where H α(p) denotes the Rényi entropy rate of order α of the probability mass function p.

FormalPara Proof

From Proposition 22.3 one can compute g(ρ) from the Rényi entropy rate which, in turn, is given by \({1+\rho \over \rho }I^*({\rho \over 1+\rho })\), where I(⋅) is the large deviation rate for the energy spectrum

$$\displaystyle \begin{aligned}-{1\over n}\ln p_{X^{(n)}}(X^{(n)}) = -{1\over n}\ln\prod_{j=1}^np(X_j) = -{1\over n}\sum_{j=1}^n\ln p(X_j).\end{aligned}$$

In particular, I(h) = c (h) is the Legendre transform of the cumulant generating function of \(-\ln p(X_1)\), namely

$$\displaystyle \begin{aligned}c(h) = \ln{\mathbb E}e^{h(-\ln p(X_1))} = \ln{\mathbb E}p^{-h}(X_1) = hH_{1-h}(p).\end{aligned}$$

Since the Legendre transform operation ∗ is idempotent (see Exercise 6), it follows that

$$\displaystyle \begin{aligned}I^*\left({\rho\over1+\rho}\right) = (c^*)^*\left({\rho\over1+\rho}\right) = c\left({\rho\over1+\rho}\right).\end{aligned}$$

In particular, g(ρ) = (1 + ρ) ρ 1+ρH 1 1+ρ = ρH( 1 1+ρ), as asserted.\(\hfill \blacksquare \)

FormalPara Theorem 22.7 (Irreducible Markov Case)

Let X 1, X 2, … be an irreducible Markov chain on S with homogeneous transition probability matrix p = ((p(y|x)))x,yS. Then the guessing exponent g(ρ) exists and is given by

$$\displaystyle \begin{aligned}g(\rho) = (1+\rho)\lambda^+\left({\rho\over 1+\rho}\right),\end{aligned}$$

where λ +(h) is the largest eigenvalue of the matrix ((π 1−h(y|x)))x,yS.

FormalPara Proof

As in the i.i.d. case, from Proposition 22.3 one can compute g(ρ) from the Rényi entropy rate which, in turn, is given by \({1+\rho \over \rho }I^*({\rho \over 1+\rho })\), where I(⋅) is the large deviation rate for the energy spectrum

$$\displaystyle \begin{aligned}-{1\over n}\ln p_{X^{(n)}}(X^{(n)}) = -{1\over n}\left\{\ln p(X_1)+\sum_{j=1}^{n-1}\ln p(X_{j+1}|X_j)\right\}.\end{aligned}$$

Note that Y j = (X j, X j+1), j = 1, 2, … is also a stationary Markov chain with one-step transition probabilities

$$\displaystyle \begin{aligned} \tilde{p}((w,z)|(x,y)) = \begin{cases} p(z|y),\quad y=w,\\ 0,\quad y\neq w. \end{cases} \end{aligned}$$

To compute I(⋅) it suffices to compute the large deviation rate for \(\sum _{j=1}^n\varphi (Y_j)\), where \(g(Y_j) = -\ln p(X_{j+1}|X_j)\). Let \(v(x,y) = -\ln p(y|x), (x,y)\in S\times S\). Then,

$$\displaystyle \begin{aligned} \begin{array}{rcl} T_hf(x,y) & =&\displaystyle \sum_{(w,z)\in S\times S}f(w,z)e^{-h\ln p(z|w)}\tilde{p}((w,z)|(x,y)) \\ & =&\displaystyle \sum_{z\in S}f(y,z)p^{-h}(z|y)p(z|y) = \sum_{z\in S}f(y,z)p^{1-h}(z|y). \end{array} \end{aligned} $$
(22.26)

Observe that T h g(x, y) = λg(x, y), (x, y) ∈ S × S implies g(x, y) = g(y), i.e., is constant in x. In particular, λ +(h) = λ(1 − h), where λ(a) is the largest eigenvalue of the matrix ((p a(y|x))(x,y) ∈ S×S. In particular, I(h) = λ (h). Again using idempotency, of the Legendre transform, I (t) = λ(t). It follows that the entropy rate is given by \({1+\rho \over \rho }\ln \lambda ({\rho \over 1+\rho })\), and therefore the guessing exponent is \(g(\rho ) = \rho {1+\rho \over \rho }\ln \lambda ({\rho \over 1+\rho })\) where λ( ρ 1+ρ) is the largest eigenvalue of the matrix \(((p^{1\over 1+\rho }(y|x)))_{(x,y)\in S\times S}\).\(\hfill \blacksquare \)

FormalPara Remark 22.3

Alternative representations of the guessing exponent in both of these cases can be obtained by consideration of level-2 large deviations as given in Hanawal and Sundaresan (2011). Moreover, the computation of the guessing exponent by these methods for other general classes of message sources can be found there.

The Kraft inequality for lengths plays an essential role in this application, specifically in Theorem 22.1 and its application in the proof of Proposition 22.3. In the classic monograph of Shannon (1948) messages are defined as sequences from a finite alphabet S, referred to as ciphers.Footnote 2 In the context of message compression, for a positive integer b one often defines a b-ary coding function as an injective map \(c: S^n\to \cup _{m=1}^\infty \{0,1,\dots , b-1\}^m\) that renders a message x ∈ S n of length n, as a b-ary sequence c(x) of length m for some m. One seeks codes c for which the average length \({\mathbb E}L(X^{(n)})\), of a message X (n), is minimal. A b-ary coding function is said to be prefix-free (or instantaneous) if for x ≠ y c(x) is not a prefix of c(y). A prefix-free code may be represented as leaves on a rooted b-ary tree obtained by coding the path from the root to the leaves (terminal vertices) with labels {0, 1, 2, …, b − 1} from left to right at each level of the tree. Therefore, a prefix-free codeword can be instantaneously decoded without reference to future codewords since the end of a codeword is immediately recognizable as a leaf.

FormalPara Proposition 22.8

Given any positive integers \(L_1,\dots , L_{|S|{ }^n}\), satisfying Kraft inequality, there is a prefix-free b-ary code on S n, b ≥ 3, whose code words have lengths \(L_1,\dots , L_{|S|{ }^{n}}\).

FormalPara Proof

Observe that for positive integers L(x), x ∈ S n, b ≥ 3, since 2 < e < 3, \(\sum _{x\in S^n}b^{-L(x)}\le 1\) if \(\sum _{x\in S^n}\exp \{-L_(x)\}\le 1\). Let m = |S|n, \(L_{\max } = \max \{L_1,\dots ,L_m\}\) and construct a full rooted b-ary tree of height \(L_{\max }\) for a b ≥ 3. Then the total number of leaves available is \(b^{L_{\max }}\), at vertices of height \(L_{\max }\) having height one label from {0, 1, …, b − 2} (see Figure 22.1). This uses \((b-1)b^{L_{\max }-1}\) of the leaves, with \(b^{L_{\max }}-(b-1)b^{L_{\max }-1} = b^{L_{\max }-1}\) remaining for coding words having lengths at most \(L_{\max }-1\). Proceed inductively.\(\hfill \blacksquare \)

Fig. 22.1
figure 1

Prefix-free code: S = {α, β}, n = 3, b = 3;L(α, β, β) = 1, L(α, α, α) = L(α, α, β) = 2;L(β, β, β) = ⋯ = L(β, α, α) = 3.

FormalPara Remark 22.4

The prefix-free b-ary code constructed in the proof of Proposition 22.8 is referred to as the Shannon code. The units for message compression are referred to as “bits” when the logarithm is base 2, and “nats” for the natural logarithm. Natural logarithms are mathematically more convenient to the problem at hand and can be used without loss of generality.

The significance of Proposition 22.8 for the present chapter is that one may assume any given lengths, i.e., subject to the Kraft inequality, to be those of a prefix-free code.

FormalPara Remark 22.5

The approximate code lengths \(L(x) = \ln {1\over p_{X^{(n)}}}\) can be obtained as the solution to minimizing expected code lengths subject to Kraft inequality by the method of Lagrange multipliers (Exercise 5). However, as noted, these are not necessarily positive integers. The existence of an optimal code is a consequence of Theorem 22.1 and Proposition 22.8 by consideration of the prefix-free code associated with \(L(x) = \lceil \ln {1\over p_{X^{(n)}}}\rceil \).

The next example illustrates a role for large deviation theory in large sample statistical inference.

FormalPara Example 2 (Efficiency of Statistical Tests of Hypothesis in Large Samples)

A common statistical test of hypothesis about an unknown parameter θ based on a random sample of size n from some distribution may be stated as follows:

The null hypothesis H 0 : θ ≤ θ 0 is to be tested against the alternative hypothesis H 1 : θ > θ 0. The test is of the form: Reject H 0 (in favor of H 1) if \(\overline {X} > a\), where \(\overline {X}\) is the (sample) mean of i.i.d. variables (X 1, …, X n) based on the random sample, and a is an appropriate number. The objective is to have small error probabilities \(\alpha _n= P(\overline {X} > a \vert H_0)\), and \(\beta _n= P(\overline {X} \le a \vert H_1)\).

There are several competing notions for the Asymptotic Relative Efficiency (ARE) of such tests. For example, in the so-called location problem, the distribution function of X is of the form F(x − θ), \(\theta \in {\mathbb R}\). In particular, F may be the normal distribution N(θ, 1). The Normal test M is of the form: Reject H 0 iff \(\overline {X} >a\). The t-test T is of the form: Reject H 0 iff \(\overline {X}/s > a\), where s is the sample standard deviation. The Sign test S is of the form: Reject H 0 iff \({1\over n}\sum _{1\le i\le n}[X_i - \theta _0\ge 0] > a\). The a-values of these tests are not necessarily the same.

The most commonly used test ARE is the Pitman ARE E P test,Footnote 3 which fixes a “small” level α n = α, and compares two tests A, B, say, based on the smallness of their β n. Specifically, the Pitman ARE of B with respect to A is E P(A, B) = nh(n), where h(n) is the sample size needed for B to attain the same level β n as attained by A based on a sample size n. The asymptotics here are generally based on weak convergence, especially the CLT (central limit theorem).

The two other important AREs we discuss in detail here are mainly based on large deviations. Chernoff-ARE:Footnote 4 Based on large deviation estimates for each test A, B, Chernoff’s (modified) test picks the value of a that minimizes α n + λβ n over all a for some fixed λ > 0. (It turns out the ARE does not depend on λ). The ratio of the large deviation rates I(A), I(B) of decay of this minimum value δ n, say, of α n + λβ n is compared for the tests A and B, and the Chernoff ARE of B with respect to A is E C(A, B) = I(B)∕I(A).

Proposition 22.9

Assume \(m_i(h)= {\mathbb E}(\exp \{hX_1\}\vert H_i) < \infty \ \ \text{for all} \ -\infty <h<\infty (i=0,1)\). Let

$$\displaystyle \begin{gathered} c_i(a)= \sup\{ah-\ln m_i(h):h \in{\mathbb R}\} (i=0,1), d(a)= \min\{c_0(a), c_1(a)\},\\ I= \max\{d(a): \theta_0\le a \le \theta_1\}, \end{gathered} $$

and \(\rho =\exp \{- I\}\). Then

$$\displaystyle \begin{aligned} \lim_{n\to \infty}{1\over n}\ln\delta_n = -I. \end{aligned} $$
(22.27)

Proof

By the upper bound in the Cramér–Chernoff theorem (i.e., Chernoff’s Inequality), \(\alpha _n + \lambda \beta _n \le \exp \{ -nc_0(a)\} + \lambda \exp \{ -nc_1(a)\}\le (1+\lambda )\exp \{ - nd(a)\}\). Minimizing over a, one arrives at the inequality δ n ≤ (1 + λ)ρ n, or \({1\over n}\ln \delta _n \le -I +{1\over n}\ln (1+\lambda )\), and \(\limsup _n{1\over n}\ln \delta _n \le -I\). For the lower bound for δ n, note that, by the Crameér–Chernoff theorem, \(\liminf {1\over n}\ln \alpha _n\ge -c_0(a)\), \(\liminf _n{1\over n}\ln \beta _n\ge -c_1(a)\). That is, given η > 0, for all sufficiently large n, \(\min \{ \alpha _n , \beta _n\} \ge \exp \{-n(d(a) +\eta )\}\), or \(\alpha _n +\lambda \beta _n \ge (1 +\lambda )\exp \{-n(d(a)+\eta )\}\). Hence, taking the minimum over a, \(\delta _n\ge (1+\lambda )\exp \{-n(I+\eta )\}\), or \({1\over n}\ln \delta _n\ge -(I+\eta ) + {1\over n}\ln (1+\lambda )\); so that \(\liminf {1\over n}\ln \delta _n\ge -(I+\eta ) \ \ \text{for all} \ \ \eta >0\). Hence \(\liminf _n{1\over n}\ln \delta _n\ge -I\).\(\hfill \blacksquare \)

The Location Problem

Consider the tests M, T, S for the location problem for F(x − θ) described in the first paragraph. Assume that F has a density f, continuous at θ = 0, and a finite variance \(\sigma ^2_f\). Then for the test H 0 : θ ≤ 0, to be tested against the alternative hypothesis H 1 : θ ≥ θ 1 > 0, one can showFootnote 5 that \(E_P(S,M)=4 \sigma ^2_ff^2(0)\). In particular, (i) if F is N(θ, 1), then E P(S, M) = 2∕π < 1, (ii) if F is Double exponential (i.e., \(f(x-\theta )= {1\over 2}\exp \{-|x-\theta |\}\)), then E P(S, M) = 2, and (iii) if f is uniform on \([-{1\over 2}-\theta , {1\over 2}-\theta ]\), then E P(S, M) = 1∕3. In all these cases (and more broadly) E P(T, M) = 1, where T is the t-test.

More interesting are Pitman comparisons among nonparametric tests for the so-called two-sample problems. Here two independent samples (X 1, …, X m), (Y 1, …, Y n) of sizes m and n are drawn from an unknown distribution whose density is of the form f((x − θ)∕σ), \(\theta \in {\mathbb R}\), σ > 0. One wishes to test H 0 : θ = 0, against H 1 : θ > 0. More generally, one wishes to test if the Y -distribution is stochastically larger than the X-distribution (i.e., P(Y > z) ≥ P(X > z) for all z, with strict inequality for at least some z). The most commonly used test for this uses the (nonparametric) statistic \(T =\overline {Y} - \overline {X}\), which rejects H 0 if T exceeds a critical value. (The critical value is determined approximately by the CLT to meet the requirement α = P(RejectH 0|H 0)). It turns out that appropriate nonparametric tests based on ranks of the combined observations X i’s and Y j’s, mostly outperform T.Footnote 6

Chernoff Index Computation

The Chernoff indices I are generally difficult to compute, since the indices try to minimize a linear combination of both error probabilities α n and β n. We consider the simple case where F = N(θ, 1), and H 0 : θ = 0, H 1 : θ = θ 1, θ 1 > 0. Again consider the test M: Reject H 0 if \(\overline {X} > a\), otherwise Reject H 1. We leave it as an exercise to show that \(I(M) =\theta _1^2/8\) (Exercise 8). For the sign test S: Reject H 0 if \({1\over n}\sum _{1\le i\le n}{\mathbf {1}}[X_i >0])> a\) for some appropriate a, as considered in the discussion of the Chernoff-ARE above, one may compute the Chernoff index I(S) from the distribution B(n, p) with p = Φ(θ 1), Φ being the distribution function of N(0, 1). Namely,

$$\displaystyle \begin{aligned} I(S) = \ln\{2(b(\theta_1))^{b(\theta_1)}(1 - b(\theta_1))^{1-b(\theta_1)}\}, \end{aligned} $$
(22.28)

where \(b(\theta ) = \ln [1 - \varPhi (\theta )]/[(\ln \{(1-\varPhi (\theta ))/\varPhi (\theta )\}]\) (Exercise 10). The ratio I(S)∕I(M) provides the Chernoff-ARE E C(S, M). One may check that E C(S, M) → 2∕π = E P(S, M) as θ 1 0 (Exercise 11).

Bahadur-ARE

As mentioned above, the Chernoff-ARE is generally difficult to compute. In addition, the threshold of the test itself is modified by the requirement of this notion of efficiency. The most popular ARE for tests based on large deviations is due to Bahadur (1960). Here is a brief description following Serfling (1980). The Bahadur-ARE is based on a large deviation rate comparison of the p-values of the tests. Consider a test of hypothesis H 0 : θ ∈ Θ 0, with a real-valued test statistic T n based on observations X 1, …, X n, rejecting H 0 if T n is large. The p-value of the test is \(L_n= \sup [1-F_{\theta _n}(T_n): \theta \in \varTheta _0]=1- F_{\theta ^{0}_n}(T_n)\), say, where \(F_{\theta _n}\) is the distribution function of the statistic under the parameter value θ. Thus L n is the random quantity which is the probability (under H 0) of the statistic being larger than what is observed, i.e., of showing a discrepancy from the null hypothesis as large or larger than what is observed. Statisticians routinely use L n to decide whether to reject H 0: smaller the p-value, stronger is the evidence against H 0. Under H 0, assuming that the distribution function \(F_{\theta ^{(0}_n}\) of T n is continuous, \(F_{\theta ^{(0)}_n}(T_n)\) has the uniform distribution on [0, 1], and so is the distribution of \(L_n = 1-F_{\theta ^{(0)}_n}(T_n)\). Under fairly general conditions, \(- 2n^{-1}\ln L_n\) converges almost surely to a constant c(θ), which is referred to as Bahadur’s (exact) slope for T n, for θ ∈ Θ 1. The Bahadur relative efficiency of a test I with respect to test II is defined by the ratio of their corresponding slopes (a.s. large deviation rates) e B(I, II) = c I(Θ)∕c II(Θ).

The following is a basic result which may be used to compute the slope of tests such as H 0 : θ ≤ θ 0, against H 1 : θ > θ 0.Footnote 7 Write Θ 1 = ΘΘ 0.

Theorem 22.10 (Bahadur (1960))

For a test sequence T n which rejects H 0 for large T n, assume (i) \(n^{-{1\over 2}}T_n\) converges a.s. (under θ) to a finite b(θ), for all θ ∈ Θ 1, and (ii) one has

$$\displaystyle \begin{aligned} \lim_{n\to\infty} -2n^{-1}\ln\sup[1-F_{\theta_n}(n^{1\over 2}t) : \theta\in\varTheta_0]= g(t),\end{aligned} $$
(22.29)

where g is continuous on an open interval I containing {b(θ) : θ ∈ Θ 1}. Then ∀ θ ∈ Θ 1, with P θ-probability one,

$$\displaystyle \begin{aligned} \lim_{n\to\infty}-2n^{-1}\ln L_n= g(b(\theta )) = c(\theta ).\end{aligned} $$
(22.30)

Proof

Fix a θ ∈ Θ 1, and let ω be any point in the sample space of P θ for which the limit (i) holds. Fix 𝜖 > 0 sufficiently small that (b(θ) − 𝜖, b(θ) + 𝜖) is contained in I. By (i), there exists n = n(ω) such that \(b(\theta )-\epsilon \le n^{-{1\over 2}}T_n(\omega )\le b(\theta ) +\epsilon \) for all n ≥ n(ω), i.e., \(n^{1\over 2}(b(\theta )-\epsilon )\le T_{n(\omega )}\le n^{1\over 2}(b(\theta )+\epsilon )\) for all n ≥ n(ω). Plugging these in \(-2n^{-1}\ln \sup [ 1- F_{\theta _n}(n^{1\over 2}t): \theta \in \varTheta _0]\), one then has

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} & &\displaystyle -2n^{-1}\ln \sup[ 1- F_{\theta_n}(b(\theta )- \epsilon)): \theta\in\varTheta_0]) \\ & &\displaystyle \quad \le - 2n^{-1}\ln L_n(\omega) \\ & &\displaystyle \quad \le -2n^{-1}\ln\sup[ 1-F_{\theta_n}(b(\theta )+\epsilon)): \theta \in\varTheta_0]) \forall\ n\ge n(\omega). \end{array} \end{aligned} $$
(22.31)

The limits as n → of the two extreme sides are g(b(θ) − 𝜖) and g(b(θ) + 𝜖). Therefore, the limit points of the middle term in (22.31) all lie in this interval. By continuity of g, it follows that the middle term converges to g(b(θ)).\(\hfill \blacksquare \)

The exact Bahadur slopes for the mean test M and the t-test T may be computed for testing H 0 : θ ≤ 0, versus the alternative H 1 : Θ > 0 in the model N(θ, 1), using the upper tail of the standard normal N(0, 1), and that of the (Student’s) t-statistic with n − 1 degrees of freedom. Using Bahadur’s theorem, one finds c M(θ) = θ 2, \(c_T(\theta )=\ln (1+\theta ^2)\) (Exercise 12). Thus e B(T, M) < 1 for all θ ∈ Θ 1. This is in contrast with both Pitman’s ARE and Chernoff’s ARE, for each of which the ARE is one.

Remark 22.6

Bahadur’s ARE also distinguishes between the frequency chi-square and the likelihood ratio test in the multinomial model, showing the latter is asymptotically more efficient than the former. Again the Pitman ARE is one between the two tests.Footnote 8

Exercises

  1. 1.

    (Shannon & Renyi Entropies) Show that the Shannon entropy H may be expressed in terms of the Renyi entropy as

    $$\displaystyle \begin{aligned}H(X_1,\dots,X_n)=\lim_{\alpha\to 1}H_\alpha(X_1,\dots,X_n).\end{aligned}$$
  2. 2.

    Complete the proof of (22.8) by showing that for random vectors H(X, Y ) ≤ H(X) + H(Y ). [Hint: Show how to express H(X) + H(Y ) − H(X, Y ) as D(p (X,Y )||p X ⊙ p Y) ≥ 0, where p (X,Y ), p X, p Y are the joint and marginal distributions, respectively.

  3. 3.

    Let P, Q be probability measures on S n. For convenience relabel S n = {1, …, k}, k = |S|n, q j = Q({j}), p j = P({j}). Prove Gibbs inequality: \(\sum _j p_j\ln p_j \ge \sum _j p_j\ln q_j\). [Hint: Consider \(\sum _jp_j\ln {q_j\over p_j}\) and bound \(\ln x\le x-1, x>0\).

  4. 4.

    Give a proof of the Kraft inequality for the message length associated with G. [Hint: \(\sum _xe^{\lceil \ln {Q_G(x)} \rceil } \le \sum _xe^{\ln Q_G({x})}\).

  5. 5.

    Show that the problem \(\min _{L\in {\mathcal L}_n:\sum _xe^{-L(x)}\le 1} \sum _{x\in S^n}p_{X^{(n)}}L(x)\) has a solution. [Hint: Use Lagrange multipliers to minimize \(J = \sum _{x\in S^n}p_{X^{(n)}}L(x) + \lambda \sum _{x\in S^n}e^{-L(x)}\). Derivatives with respect to each L(x) are zero and λ can be determined from the constraint ∑x e L(x) ≤ 1.

  6. 6.

    Let u, v be twice-continuously differentiable functions on \({\mathbb R}\) with Legendre transforms u , v , respectively, where \(f^*(x) = \sup _{h\in {\mathbb R}}\{xh-f(h)\}\), \(x\in {\mathbb R}\). Show that (a) u is convex. (b) (Idempotency) u ∗∗ = u[Hint: Write u (x) = xh(x) − u(h(x)) and use the smoothness hypothesis on u to optimize.

  7. 7.

    Give a proof for (22.6).

  8. 8.

    Show that \(I(M) =\theta _1^2/8\).

  9. 9.

    Give a proof of (22.6).

  10. 10.

    Verify the hypotheses (i),(ii) in Theorem 22.10 for the tests M, T, S. [Hint: For M, let \(T_n = n^{1\over 2}(\overline {X}-\theta _0)\), then (i) is satisfied, since for \(\theta >\theta _0, n^{-{1\over 2}}T_n \to \theta -\theta _0\). For assumption (ii) assume that X j has a finite moment generating function, and use the Cramér-Chernoff large deviation rate. A similar, but little longer, proof applies to the statistic T, using independence of the sample mean and sample variance. For S one uses the moment generating function of Bernoulli variables.]

  11. 11.

    Show that E C(S, M) → 2∕π = E P(S, M) as θ 1 0.

  12. 12.

    Show using Bahadur’s theorem, that c M(θ) = θ 2, \(c_T(\theta )=\ln (1+\theta ^2)\).