Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

14.1 The Definitions and Properties of Information and Entropy

Suppose one conducts an experiment whose outcome is not predetermined. The term “experiment” will have a broad meaning. It may be a test of a new device, a satellite launch, a football match, a referendum and so on. If, in a football match, the first team is stronger than the second, then the occurrence of the event A that the first team won carries little significant information. On the contrary, the occurrence of the complementary event \(\overline{A}\) contains a lot of information. The event B that a leading player of the first team was injured does contain information concerning the event A. But if it was the first team’s doctor who was injured then that would hardly affect the match outcome, so such an event B carries no significant information about the event A.

The following quantitative measure of information is conventionally adopted. Let A and B be events from some probability space \(\langle\varOmega, \mathfrak{F}, \mathbf{P}\rangle\).

Definition 14.1.1

The amount of information about the event A contained in the event (message) B is the quantity

$$I(A | B):= \log\frac{\mathbf{P}(A | B)}{\mathbf{P}(A)} . $$

The occurrence of the event B=A may be interpreted as the message that A took place.

Definition 14.1.2

The number I(A):=I(A|A) is called the amount of information contained in the message A:

$$I(A):= I(A | A) = -\log\mathbf{P}(A) . $$

We see from this definition that the larger the probability of the event A, the smaller I(A). As a rule, the logarithm to the base 2 is used in the definition of information. Thus, say, the message that a boy (or girl) was born in a family carries a unit of information (it is supposed that these events are equiprobable, and −log2 p=1 for p=1/2). Throughout this chapter, we will write just logx for log2 x.

If the events A and B are independent, then I(A|B)=0. This means that the event B does not carry any information about A, and vice versa. It is worth noting that we always have

$$I(A | B) = I(B | A). $$

It is easy to see that if the events A and B are independent, then

$$ I(AB) = I(A) + I(B) . $$
(14.1.1)

Consider an example. Let a chessman be placed at random on one of the squares of a chessboard. The information that the chessman is on square number k (the event A) is equal to I(A)=log64=6. Let B 1 be the event that the chessman is in the i-th row, and B 2 that the chessman is in the j-th column. The message A can be transmitted by transmitting B 1 first and then B 2. We have

$$I(B_1) = \log8 = 3 = I(B_2 ) . $$

Therefore

$$I(B_1) + I(B_2) = 6 = I(A) , $$

so that transmitting the message A “by parts” requires communicating the same amount of information (which is equal to 6) as transmitting A itself. One could give other examples showing that the introduced numerical characteristics are quite natural.

Let G be an experiment with outcomes E 1,…,E N occurring with probabilities p 1,…,p N .

The information resulting from the experiment G is a random variable J G =J G (ω) assuming the value −logp j on the set E j , j=1,…,N.

Thus, if in the probability space \(\langle\varOmega, \mathfrak{F}, \mathbf{P}\rangle\) corresponding to the experiment G, Ω coincides with the set (E 1,…,E N ), then J G (ω)=I(ω).

Definition 14.1.3

The expectation of the information obtained in the experiment G, E J G =−∑p j logp j , is called the entropy of the experiment. We shall denote it by

$$H_{\mathbf{p}} = H(G) := - \sum_{j=1}^N p_j \log p_j , $$

where p=(p 1,…,p N ). For p j =0, by continuity we set p j logp j to be equal to zero.

The entropy of an experiment is, in a sense, a measure of its uncertainty. Let, for example, our experiment have two outcomes A and B with probabilities p and q=1−p, respectively. The entropy of the experiment is equal to

$$H_{\mathbf{p}} = - p \log p - (1 - p) \log(1 - p) = f(p) . $$

The graph of this function is depicted in Fig. 14.1.

Fig. 14.1
figure 1

The plot of the entropy f(p) of a random experiment with two outcomes

The only maximum of f(p) equals log2=1 and is attained at the point p=1/2. This is the case of maximum uncertainty. If p decreases, then the uncertainty also decreases together with H p , and H p =0 for p=(0,1) or (0,1).

The same properties can easily be seen in the general case as well.

The properties of entropy.

  1. 1.

    H(G)=0 if and only if there exists a j, 1≤jN, such that p j =P(E j )=1.

  2. 2.

    H(G) attains its maximum when p j =1/N for all j.

Proof

The second derivative of the function β(x)=xlogx is positive on [0,1], so that β(x) is convex. Therefore, for any q i ≥0 such that \(\sum_{i=1}^{N} q_{i} = 1\), and any x i ≥0, one has the inequality

$$\beta \Biggl(\,\sum_{i=1}^N q_i x_i \Biggr) \le\sum_{i=1}^N q_i \beta (x_i) . $$

If we take q i =1/N, x i =p i , then

$$\Biggl(\frac{1}{N} \sum_{i=1}^N p_i \Biggr) \log \Biggl(\frac{1}{N} \sum _{i=1}^N p_i \Biggr) \le\sum _{i=1}^N \frac{1}{N} p_i \log p_i . $$

Setting \(\mathbf{u} := (\frac{1}{N}, \ldots, \frac{1}{N})\) we obtain from this that

$$\begin{aligned} - \log\frac{1}{N} = \log N = H_{\mathbf{u}} \ge- \sum _{i=1}^N p_i \log p_i = H_{\mathbf{p}} . \end{aligned}$$

 □

Note that if the entropy H(G) equals its maximum value H(G)=logN, then J G (ω)=logN with probability 1, i.e. the information J G (ω) becomes constant.

  1. 3.

    Let G 1 and G 2 be two independent experiments. We write down the outcomes and their probabilities in these experiments in the following way:

    $$G_1 = \left ( \begin{array}{c} E_1 , \ldots, E_N \\p_1 , \ldots, p_N \end{array} \right ) , \qquad G_2 = \left ( \begin{array}{c} A_1 , \ldots, A_M \\q_1 , \ldots, q_M \end{array} \right ) . $$

Combining the outcomes of these two experiments we obtain a new experiment

$$G = G_1 \times G_2 = \left ( \begin{array}{c} E_1 A_1 , E_1 A_2 ,\ldots , E_N A_M \\p_1 q_1 , p_1 q_2 ,\ldots , p_N q_M \end{array} \right ) . $$

The information J G obtained as a result of this experiment is a random variable taking values −logp i q j with probabilities p i q j , i=1,…,N; j=1,…,M. But the sum \(J_{G_{1} } + J_{G_{2}}\) of two independent random variables equal to the amounts of information obtained in the experiments G 1 and G 2, respectively, clearly has the same distribution. Thus the information obtained in a sequence of independent experiments is equal to the sum of the information from these experiments. Since in that case clearly

$$\mathbf{E}J_G = \mathbf{E}J_{G_1 } + \mathbf{E}J_{G_2 } , $$

we have that for independent G 1 and G 2 the entropy of the experiment G is equal to the sum of the entropies of the experiments G 1 and G 2:

$$H(G) = H(G_1) + H(G_2) . $$
  1. 4.

    If the experiments G 1 and G 2 are dependent, then the experiment G can be represented as

    $$G= \left ( \begin{array}{c} E_1 A_1 , E_1 A_2 ,\ldots , E_N A_M \\q_{11} , q_{12} , \ldots , q_{NM} \end{array} \right ) $$

    with q ij =p i p ij , where p ij is the conditional probability of the event A j given E i , so that

    $$\begin{aligned} \sum_{j=1}^M q_{ij} =& p_i = \mathbf{P}(E_i) ,\quad i = 1, \ldots, N ;\\\sum_{j=1}^N q_{ij} =& q_j = \mathbf{P}(A_i) ,\quad j = 1, \ldots, M . \end{aligned}$$

In this case the equality \(J_{G} = J_{G_{1}} + J_{G_{2}}\), generally speaking, does not hold. Introduce a random variable \(J_{2}^{*}\) which is equal to −logp ij on the set E i A j . Then evidently \(J_{G} = J_{G_{1}} + J_{2}^{*}\). Since

$$\mathbf{P}(A | E_i) = p_{ij} , $$

the quantity \(J_{2}^{*}\) for a fixed i can be considered as the information from the experiment G 2 given the event E i occurred. We will call the quantity

$$\mathbf{E}\bigl(J_2^* | E_i\bigr) = - \sum _{j=1}^M p_{ij} \log p_{ij} $$

the conditional entropy H(G 2|E 1) of the experiment G 2 given E i , and the quantity

$$\mathbf{E}J_2^* = - \sum_{i,j} q_{ij} \log p_{ij} = \sum_i p_i H(G_2 | E_1) $$

the conditional entropy H(G 2|G 1) of the experiment G 2 given G 1. In this notation, we obviously have

$$H(G) = H(G_1) + H(G_2 | G_1) . $$

We will prove that in this equality we always have

$$H(G_2 | G_1 )\le H(G_2 ), $$

i.e. for two experiments G 1 and G 2 the entropy H(G) never exceeds the sum of the entropies H(G 1) and H(G 2):

$$H(G) = H(G_1 \times G_2) \le H(G_1) + H(G_2) . $$

Equality takes place here only when q ij =p i q j , i.e. when G 1 and G 2 are independent.

Proof

First note that, for any two distributions (u 1,…,u n ) and (v 1,…,v n ), one has the inequality

$$ - \sum_i u_i \log u_i \le- \sum_i u_i \log v_i , $$
(14.1.2)

equality being possible here only if v i =u i , i=1,…,n. This follows from the concavity of the function logx, since it implies that, for any a i >0,

$$\sum_i u_i \log a_i \le\log \biggl(\sum_i u_i a_i \biggr) , $$

equality being possible only if a 1=a 2=⋯=a n . Putting a i =v i /u i , we obtain relation (14.1.2).

Next we have

$$H(G_1) + H(G_2) = - \sum_{i,j} q_{ij} (\log p_i + \log q_j) = - \sum _{i,j} q_{ij} \log p_i q_j , $$

and because {p i q j } is obviously a distribution, by virtue of (14.1.2)

$$- \sum q_{ij} \log p_i q_j \ge- \sum q_{ij} \log q_{ij} = H(G) $$

holds, and equality is possible here only if q ij =p i q j . □

  1. 5.

    As we saw when considering property 3, the information obtained as a result of the experiment \(G_{1}^{n}\) consisting of n independent repetitions of the experiment G 1 is equal to

    $$J_{G_1^n} = - \sum_{j=1}^N \nu_j \log p_j , $$

    where ν j is the number of occurrences of the outcome E j . By the law of large numbers, \({\nu_{j}}/{n} \stackrel{p}{\to}p_{j}\) as n→∞, and hence

    $$\frac{1}{n} J_{G_1^n} \stackrel{p}{\to}H(G_1) = H_p . $$

To conclude this section, we note that the measure of the amount of information resulting from an experiment we considered here can be derived as the only possible one (up to a constant multiplier) if one starts with a few simple requirements that are natural to impose on such a quantity.Footnote 1

It is also interesting to note the connections between the above-introduced notions and large deviation probabilities. As one can see from Theorems 5.1.2 and 5.2.4, the difference between the “biased” entropy \(- \sum p_{j}^{*} \ln p_{j}\) and the entropy \(- \sum p_{j}^{*} \ln p_{j}^{*}\) (\(p_{j}^{*} = {\nu_{j}}/n\) are the relative frequencies of the outcomes E j ) is an analogue of the deviation function (see Sect. 8.8) in the multi-dimensional case.

14.2 The Entropy of a Finite Markov Chain. A Theorem on the Asymptotic Behaviour of the Information Contained in a Long Message; Its Applications

14.2.1 The Entropy of a Sequence of Trials Forming a Stationary Markov Chain

Let \(\{ X_{k} \}_{k=1}^{\infty}\) be a stationary finite Markov chain with one class of essential states without subclasses, E 1,…,E N being its states. Stationarity of the chain means that P(X 1=j)=π j coincide with the stationary probabilities. It is clear that

$$\mathbf{P}(X_2 = j) = \sum_k \pi_k p_{kj} = \pi_j ,\qquad\mathbf {P}(X_3 = j) = \pi _j , \quad\mbox{and so on.} $$

Let G k be an experiment determining the value of X k (i.e. the state the system entered on the k-th step). If X k−1=i, then the entropy of the k-th step equals

$$H(G_k | X_{k-1} = i) = - \sum _j p_{ij } \log p_{ij} . $$

By definition, the entropy of a stationary Markov chain is equal to

$$H = \mathbf{E}H(G_k | X_{k-1}) = H(G_k | G_{k-1}) = - \sum_i \pi_i \sum_j p_{ij } \log p_{ij} . $$

Consider the first n steps X 1,…,X n of the Markov chain. By the Markov property, the entropy of this composite experiment G (n)=G 1×⋯×G n is equal to

$$\begin{aligned} H\bigl(G^{(n)}\bigr) = & H(G_1) + H(G_2 | G_1) + \cdots+ H(G_n | G_{n-1}) \\= & - \sum\pi_j \log\pi_j + (n-1) H \sim n H \end{aligned}$$

as n→∞. If X k were independent then, as we saw, we would have exact equality here.

14.2.2 The Law of Large Numbers for the Amount of Information Contained in a Message

Now consider a finite sequence (X 1,…,X n ) as a message (event) C n and denote, as before, by I(C n )=−logP(C n ) the amount of information contained in C n . The value of I(C n ) is a function on the space of elementary outcomes equal to the information \(J_{G^{(n)}}\) contained in the experiment G (n). We now show that, with probability close to 1, this information behaves asymptotically as nH, as was the case for independent X k . Therefore H is essentially the average information per trial in the sequence \(\{ X_{k} \}_{k=1}^{\infty}\).

Theorem 14.2.1

As n→∞,

$$\frac{I(C_n)}{n} = \frac{- \log\mathbf{P}(C_n)}{n} \stackrel {\mathit{a}.\mathit{s}.}{ \longrightarrow}H . $$

This means that, for any δ>0, the set of all messages C n can be decomposed into two classes. For the first class, |I(C n )/nH|<δ, and the sum of the probabilities of the elements of the second class tends to 0 as n→∞.

Proof

Construct from the given Markov chain a new one \(\{ Y_{k} \}_{k=1}^{\infty}\) by setting Y k :=(X k ,X k+1). The states of the new chain are pairs of states (E i ,E j ) of the chain {X k } with p ij >0. The transition probabilities are obviously given by

$$p_{(i,j)(k,l)} = \biggl\{ \begin{array}{c@{\quad}c} 0,&j \ne k ,\cr p_{kl} ,&j = k . \end{array} \biggr. $$

Note that one can easily prove by induction that

$$ p_{(i,j)(k,l)}(n) = p_{jk}(n-1) p_{kl} . $$
(14.2.1)

From the definition of {Y k } it follows that the ergodic theorem holds for this chain. This can also be seen directly from (14.2.1), the stationary probabilities being

$$\lim_{n \to\infty} p_{(i,j)(k,l)}(n) = \pi_k p_{kl} . $$

Now we will need the law of large numbers for the number of visits m (k,l)(n) of the chain \(\{ Y_{k} \}_{k=1}^{\infty}\) to state (k,l) over time n. By virtue of this law (see Theorem 13.4.4),

$$\frac{m_{(k,l)}(n)}{n} \stackrel{\mathit{a}.\mathit {s}.}{\longrightarrow} \pi_k p_{kl} \quad\mbox{as} \ n \to\infty. $$

Consider the random variable P(C n ):

$$\begin{aligned} \mathbf{P}(C_n) = \mathbf{P}(E_{X_1}E_{X_2} \cdots E_{X_n}) = & \mathbf{P}(E_{X_1}) \mathbf{P}(E_{X_2} | E_{X_1}) \cdots\mathbf{P}(E_{X_n} | E_{X_{n-1}}) \\= & \pi_{X_1} p_{X_1 X_2} \cdots p_{X_{n-1} X_n} = \pi_{X_1} \prod_{(k,l)} p_{kl}^{m_{(k,l)}(n-1)} . \end{aligned}$$

The product here is taken over all pairs (k,l). Therefore (π i =P(X 1=i))

 □

14.2.3 The Asymptotic Behaviour of the Number of the Most Common Outcomes in a Sequence of Trials

Theorem 14.2.1 has an important corollary. Rank all the messages (words) C n of length n according to the values of their probabilities in descending order. Next pick the most probable words one by one until the sum of their probabilities exceeds a prescribed level α, 0<α<1. Denote the number (and also the set) of the selected words by M α (n).

Theorem 14.2.2

For each 0<α<1, there exists one and the same limit

$$\lim_{n \to\infty} \frac{\log M_{\alpha}(n)}{n} = H . $$

Proof

Let δ>0 be a number, which can be arbitrarily small. We will say that C n falls into category K 1 if its probability P(C n )>2n(Hδ), and into category K 2 if

$$2^{- n (H + \delta)} < \mathbf{P}(C_n) \le2^{- n (H - \delta)}. $$

Finally, C n belongs to the third category K 3 if

$$\mathbf{P}(C_n) \le2^{- n (H + \delta)} . $$

Since, by Theorem 14.2.1, P(C n K 1K 3)→0 as n→∞, the set M α (n) contains only the words from K 1 and K 2, and the last word from M α (n) (i.e. having the smallest probability)—we denote it by C α,n —belongs to K 2. This means that

$$M_{\alpha}(n) 2^{- n (H + \delta)} < \sum_{C_n \in M_{\alpha}(n)} \mathbf{P}(C_n) < \alpha+ \mathbf{P}(C_{\alpha, n}) < \alpha+ 2^{- n (H - \delta)} . $$

This implies

$$\frac{\log M_{\alpha}(n)}{n} < \frac{(\alpha+ 2^{- n (H - \delta )})}{n} + H + \delta. $$

Since δ is arbitrary, we have

$$\mathop{\mathrm{lim}\,\mathrm{sup}}_{n \to\infty} \frac{\log M_{\alpha}(n)}{n} \le H . $$

On the other hand, the words from K 2 belonging to M α (n) have total probability ≥αP(K 1). If \(M_{\alpha}^{(2)}(n)\) is the number of these messages then

$$M_{\alpha}^{(2)}(n) 2^{- n (H - \delta)} \ge\alpha- \mathbf {P}(K_1) , $$

and, consequently,

$$M_{\alpha}(n) 2^{- n (H - \delta)} \ge\alpha- \mathbf{P}(K_1) . $$

Since P(K 1)→0 as n→∞, for sufficiently large n one has

$$\frac{\log M_{\alpha}(n)}{n} \ge H - \delta+ \frac{1}{n} \log\frac {\alpha }{2} . $$

It follows that

$$\limsup_{n \to\infty} \frac{\log M_{\alpha}(n)}{n} \ge H . $$

The theorem is proved. □

Now one can obtain a useful interpretation of this theorem. Let N be the number of the chain states. Suppose for simplicity’s sake that N=2m. Then the number of different words of length n (chains C n ) will be equal to N n=2nm. Suppose, further, that these words are transmitted using a binary code, so that m binary symbols are used to code every state. Thus, with such transmission method—we will call it direct coding—the length of the messages will be equal to nm. (For example, one can use Markov chains to model the Russian language and take N=32, m=5.) The assertion of Theorem 14.2.2 means that, for large n, with probability 1−ε, ε>0, only 2nH of the totality of 2nm words will be transmitted. The probability of transmitting all the remaining words will be small if ε is small. From this it is easy to establish the existence of another more economical code requiring, with a large probability, a smaller number of digits to transmit a word. Indeed, one can enumerate the selected 2nH most likely words using, say, a binary code again, and then transmit only the number of the word. This clearly requires only nH digits. Since we always have H≤logN=m, the length of the message will be m/H≥1 times smaller.

This is a special case of the so-called basic coding theorem for Markov chains: for large n, there exists a code for which, with a high probability, the original message C n can be transmitted by a sequence of signals which is m/H times shorter than in the case of the direct coding.

The above coding method is rather an oversimplified example than a recipe for efficiently compressing the messages. It should be noted that finding a really efficient coding method is a rather difficult task. For example, in Morse code it is reasonable to encode more frequent letters by shorter sequences of dots and dashes. However, the text reduction by m/H times would not be achieved. Certain compression techniques have been used in this book as well. For example, we replaced the frequently encountered words “characteristic function” by “ch.f.” We could achieve better results if, say, shorthand was used. The structure of a code with a high compression coefficient will certainly be very complicated. The theorems of the present chapter give an upper bound for the results we can achieve.

Since \(H = \sum\frac{1}{n} \log N = m\), for a sequence of independent equiprobable symbols, such a text is incontractible. This is why the proximity of “new” messages (encoded using a new alphabet) to a sequence of equiprobable symbols could serve as a criterion for constructing new codes.

It should be taken into account, however, that the text “redundancy” we are “fighting” with is in many cases a useful and helpful phenomenon. Without such redundancy, it would be impossible to detect misprints or reconstruct omissions as easily as we, say, restore the letter “r” in the word “info⋅mation”.

The reader might know how difficult it is to read a highly abridged and formalised mathematical text. While working with an ideal code no errors would be admissible (even if we could find any), since it is impossible to reconstruct an omitted or distorted symbol in a sequence of equiprobable digits. In this connection, there arises one of the basic problems of information theory: to find a code with the smallest “redundancy” which still allows one to eliminate the transmission noise.