Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Entropy is a notion taken form Thermodynamics, where it describes the uncertainty in the movement of gas particles. In this chapter the entropy will be considered as a measure of uncertainty of a random variable.

Maximum entropy distributions, with certain moment constraints, will play a central role in this chapter. They are distributions with a maximal ignorance degree towards unknown elements of the distribution. For instance, if nothing is known about a distribution defined on the interval [a, b], it makes sense to express our ignorance by choosing the distribution to be the uniform one. Sometimes the mean is known. In this case the maximum entropy decreases and the distribution is not uniform any more. More precisely, among all distributions p(x) defined on (0, ) with a given mean μ, the one with the maximum entropy is the exponential distribution. Furthermore, if both the mean and the standard variation are given for a distribution p(x) defined on \(\mathbb{R}\), then the distribution with the largest entropy is the normal distribution.

Since the concept of entropy can be applied to any point of a statistical model, the entropy becomes a function defined on the statistical model. Then, likewise in Thermodynamics, we shall investigate the entropy maxima, as they have a distinguished role in the theory.

1 Introduction to Information Entropy

The notion of entropy comes originally from Thermodynamics. It is a quantity that describes the amount of disorder or randomness in a system bearing energy or information. In Thermodynamics the entropy is defined in terms of heat and temperature.

According to the second law of Thermodynamics, during any process the change in the entropy of a system and its surroundings is either zero or positive. The entropy of a free system tends to increase in time, towards a finite or infinite maximum. Some physicists define the arrow of time in the direction in which its entropy increases, see Hawking [43]. Most processes tend to increase their entropy in the long run. For instance, a house starts falling apart, an apple gets rotten, a person gets old, a car catches rust over time, etc.

Another application of entropy is in information theory, formulated by C. E. Shannon [73] in 1948 to explain aspects and problems of information and communication. In this theory a distinguished role is played by the information source, which produces a sequence of messages to be communicated to the receiver. The information is a measure of the freedom of choice with which a message can be selected from the set of all possible messages. The information can be measured numerically using the logarithm in base 2. In this case the resulting units are called binary digits, or bits. One bit measures a choice between two equally likely choices. For instance, if a coin is tossed but we are unable to see it as it lands, the landing information contains 1 bit of information. If there are N equally likely choices, the number of bits is equal to the digital logarithm of the number of choices, log2 N. In the case when the choices are not equally probable, the situation will be described in the following.

Shannon defined a quantity that measures how much information, and at which rate this information is produced by an information source. Suppose there are n possible elementary outcomes of the source, \(A_{1},\ldots,A_{n}\), which occur with probabilities p 1 = p(A 1), \(\ldots\), p n  = p(A n ), so the source outcomes are described by the discrete probability distribution

$$\displaystyle{\begin{array}{|c|cccc|}\hline \mbox{ event} &A_{1} & A_{2} & \ldots & A_{n} \\\hline \mbox{ probability}& p_{1} & p_{2} & \ldots & p_{n} \\\hline \end{array} }$$

with p i given. Assume there is an uncertainty function, \(H(p_{1},\ldots,p_{n})\), which “measures” how much “choice” is involved in selecting an event. It is fair to ask that H satisfies the following properties (Shannon’s axioms):

  1. (i)

    H is continuous in each p i ;

  2. (ii)

    If \(p_{1} =\ldots = p_{n} = \frac{1} {n}\), then H is monotonic increasing function of n (i.e., for equally likely events there is more uncertainty when there are more possible events).

  3. (iii)

    If a choice is broken down into two successive choices, then the initial H is the weighted sum of the individual values of H:

    $$\displaystyle\begin{array}{rcl} & & H(p_{1},p_{2},\ldots,p_{n-1},p^{\prime}_{n},p^{\prime \prime}_{n}) = H(p_{1},p_{2},\ldots,p_{n-1},p_{n}) {}\\ & & \quad + p_{n}H\Big(\frac{p^{\prime}_{n}} {p_{n}}, \frac{p^{\prime \prime}_{n}} {p_{n}}\Big), {}\\ \end{array}$$

    with \(p_{n} = p^{\prime}_{n} + p^{\prime \prime}_{n}\).

Shannon proved that the only function H satisfying the previous three assumptions is of the form

$$\displaystyle{H = -k\sum _{i=1}^{n}p_{ i}\log _{2}p_{i},}$$

where k is a positive constant, which amounts to the choice of a unit of measure. The negative sign in front of the summation formula implies its non-negativity. This is the definition of the information entropy for discrete systems given by Shannon [73]. It is remarkable that this is the same expression seen in certain formulations of statistical mechanics.

Since the next sections involve integration and differentiation, it is more convenient to use the natural logarithm instead of the digital logarithm. The entropy defined by \(H = -\sum _{i=1}^{n}p_{i}\ln p_{i}\) is measured in natural units instead of bits.Footnote 1 Sometimes this is also denoted by \(H(p_{1},\ldots,p_{n})\).

We make some more remarks regarding notation. We write H(X) to denote the entropy of a random variable X, H(p) to denote the entropy of a probability density p, and H(ξ) to denote the entropy H(p ξ ) on a statistical model with parameter ξ. The joint entropy of two random variables X and Y will be denoted by H(X, Y ), while H(X | Y ) will be used for the conditional entropy of X given Y. These notations will be used interchangeably, depending on the context.

The entropy can be used to measure information in the following way. The information can be measured as a reduction in the uncertainty, i.e. entropy. If X and Y are random variables that describe an event, the initial uncertainty about the event is H(X). After the random variable Y is revealed, the new uncertainty is H(X | Y ). The reduction in uncertainty, H(X) − H(X | Y ), is called the information conveyed about X by Y. Its symmetry property is left as an exercise in Problem 3.3, part (d).

In the case of a discrete random variable X, the entropy can be interpreted as the weighted average of the numbers − lnp i , where the weights are the probabilities of the values of the associated random variable X. Equivalently, this can be also interpreted as the expectation of the random variable that assumes the value − lnp i with probability p i

$$\displaystyle{H(X) = -\sum _{i=1}^{n}P(X = x_{ i})\ln P(X = x_{i}) = E[-\ln P(X)].}$$

Extending the situation from the discrete case, the uncertainty of a continuous random variable X defined on the interval (a, b) will be defined by an integral. If p denotes the probability density function of X, then the integral

$$\displaystyle{H(X) = -\int _{a}^{b}p(x)\ln p(x)\,dx}$$

defines the entropy of X, provided the integral is finite.

This chapter considers the entropy on statistical models as a function of its parameters. It provides examples of statistical manifolds and their associated entropies and deals with the main properties of the entropy regarding bounds, maximization and relation with the Fisher information metric.

2 Definition and Examples

Let \(\mathcal{S} =\{ p_{\xi } = p(x;\xi );\xi = (\xi ^{1},\ldots,\xi ^{n}) \in \mathbb{E}\}\) be a statistical model, where \(p(\cdot,\xi ): \mathcal{X} \rightarrow [0,1]\) is the probability density function which depends on parameter vector ξ. The entropy on the manifold \(\mathcal{S}\) is a function \(H: \mathbb{E} \rightarrow \mathbb{R}\), which is equal to the negative of the expectation of the log-likelihood function, \(H(\xi ) = -E_{p_{\xi }}[\ell_{x}(\xi )]\). More precisely,

$$\displaystyle{H(\xi ) = \left \{\begin{array}{ll} -\int _{\mathcal{X}}p(x,\xi )\ln p(x,\xi )\,dx,&\mbox{ if $\mathcal{X}$ is continuous;}\\ \\ -\sum _{x\in \mathcal{X}}p(x,\xi )\ln p(x,\xi ), &\mbox{ if $\mathcal{X}$ is discrete.}\\ \end{array} \right.}$$

Since the entropy is associated with each distribution p(x, ξ), we shall also use the alternate notation \(H\big(p(x,\xi )\big)\). Sometimes, the entropy in the continuous case is called differential entropy, while in the discrete case is called discrete entropy.

It is worth noting that in the discrete case the entropy is always positive, while in the continuous case might be zero or negative. Since a simple scaling of parameters will modify a continuous distribution with positive entropy into a distribution with a negative entropy (see Problem 3.4.), in the continuous case there is no canonical entropy, but just a relative entropy. In order to address this drawback, the entropy is modified into the relative information entropy, as we shall see in Chap. 4.

The entropy can be defined in terms of a base measure on the space \(\mathcal{X}\), but for keeping the exposition elementary we shall assume that \(\mathcal{X} \subseteq \mathbb{R}^{n}\) with the Lebesgue-measure dx.

The entropy for a few standard distributions is computed in the next examples.

Example 3.2.1 (Normal Distribution)

In this case \(\mathcal{X} = \mathbb{R}\), \(\xi = (\mu,\sigma ) \in \mathbb{R} \times (0,\infty )\) and

$$\displaystyle{p(x;\xi ) = \frac{1} {\sigma \sqrt{2\pi }}\,e^{-\frac{(x-\mu )^{2}} {2\sigma ^{2}} }.}$$

The entropy is

$$\displaystyle\begin{array}{rcl} H(\mu,\sigma )& =& -\int _{\mathcal{X}}p(x)\ln p(x)\,dx {}\\ & =& -\int _{\mathcal{X}}p(x)\Big(-\frac{1} {2}\ln (2\pi ) -\ln \sigma -\frac{(x-\mu )^{2}} {2\sigma ^{2}} \Big)\,dx {}\\ & =& \frac{1} {2}\ln (2\pi ) +\ln \sigma + \frac{1} {2\sigma ^{2}}\int _{\mathcal{X}}(x-\mu )^{2}p\,dx {}\\ & =& \frac{1} {2}\ln (2\pi ) +\ln \sigma + \frac{1} {2\sigma ^{2}} \cdot \sigma ^{2} {}\\ & =& \frac{1} {2}\ln (2\pi ) +\ln \sigma +\frac{1} {2} {}\\ & =& \ln (\sigma \sqrt{2\pi e}). {}\\ \end{array}$$

It follows that the entropy does not depend on μ, and is increasing logarithmically as a function of σ, with \(\lim _{\sigma \searrow 0}H = -\infty \), \(\lim _{\sigma \nearrow \infty }H = \infty \). Furthermore, the change of coordinates \(\varphi: \mathbb{E} \rightarrow \mathbb{E}\) under which the entropy is invariant, i.e. \(H(\xi ) = H\big(\varphi (\xi )\big)\), are only the translations \(\varphi (\mu,\sigma ) = (\mu +k,\sigma )\), \(k \in \mathbb{R}\).

Example 3.2.2 (Poisson Distribution)

In this case the sample space is \(\mathcal{X} = \mathbb{N}\), and the probability density

$$\displaystyle{p(n;\xi ) = e^{-\xi }\frac{\xi ^{n}} {n!},\qquad n \in \mathbb{N},\;\xi \in \mathbb{R}}$$

depends only on one parameter, ξ. Using \(\ln p(n,\xi ) = -\xi + n\ln \xi -\ln (n!)\), we have

$$\displaystyle\begin{array}{rcl} H(\xi )& =& -\sum _{n\geq 0}p(n,\xi )\ln p(n,\xi ) {}\\ & =& -\sum _{n\geq 0}\Big(-\xi e^{-\xi }\frac{\xi ^{n}} {n!} + n\ln \xi e^{-\xi }\frac{\xi ^{n}} {n!} -\ln (n!)e^{-\xi }\frac{\xi ^{n}} {n!}\Big) {}\\ & =& \xi e^{-\xi }\mathop{\underbrace{\sum _{ n\geq 0} \frac{\xi ^{n}} {n!}}}\limits _{=e^{\xi }} -\ln \xi \, e^{-\xi }\sum _{n\geq 0}\frac{n\xi ^{n}} {n!} + e^{-\xi }\sum _{ n\geq 0}\frac{\xi ^{n}\ln (n!)} {n!} {}\\ & =& \xi -\ln \xi \,e^{-\xi }\xi e^{\xi } + e^{-\xi }\sum _{ n\geq 0}\frac{\ln (n!)} {n!} \xi ^{n} {}\\ & =& \xi (1-\ln \xi ) + e^{-\xi }\sum _{ n\geq 0}\frac{\ln (n!)} {n!} \xi ^{n}. {}\\ \end{array}$$

We note that \(\lim _{\xi \searrow 0}H(\xi )=0\) and H(x) < , since the series \(\sum _{n\geq 0}\frac{\xi ^{n}\ln (n!)} {n!}\) has an infinite radius of convergence, see Problem 3.21.

Example 3.2.3 (Exponential Distribution)

Consider the exponential distribution

$$\displaystyle{p(x;\xi ) =\xi e^{-\xi x},\qquad x > 0,\;\xi > 0}$$

with parameter ξ. The entropy is

$$\displaystyle\begin{array}{rcl} H(\xi )& =& -\int _{0}^{\infty }p(x)\ln p(x)\,dx = -\int _{ 0}^{\infty }\xi e^{-\xi x}(\ln \xi -\xi x)\,dx {}\\ & =& -\xi \ln \xi \int _{0}^{\infty }e^{-\xi x}\,dx +\xi \int _{ 0}^{\infty }\xi e^{-\xi x}\,x\,dx {}\\ & =& -\ln \xi \mathop{\underbrace{\int _{0}^{\infty }p(x,\xi )\,dx}}\limits _{=1} +\xi \mathop{\underbrace{ \int _{0}^{\infty }xp(x,\xi )\,dx}}\limits _{=1/\xi } {}\\ & =& 1-\ln \xi, {}\\ \end{array}$$

which is a decreasing function of ξ, with H(ξ) > 0 for ξ ∈ (0, e). Making the parameter change \(\lambda = \frac{1} {\xi }\), the model becomes \(p(x;\lambda ) = \frac{1} {\lambda } e^{-x/\lambda }\), λ > 0. The entropy H(λ) = 1 + lnλ increases logarithmically in λ. We note the fact that the entropy is parametrization dependent.

Example 3.2.4 (Gamma Distribution)

Consider the family of distributions

$$\displaystyle{p_{{\xi }}(x) = p_{{\alpha,\beta }}(x) = \frac{1} {\beta ^{\alpha }\varGamma (\alpha )}\,\,x^{\alpha -1}e^{-x/\beta },}$$

with positive parameters \((\xi ^{1},\xi ^{2}) = (\alpha,\beta )\) and x > 0. We shall start by showing that

$$\displaystyle{ \int _{0}^{\infty }\ln x\,\,p_{{\alpha,\beta }}(x)\,dx =\ln \beta +\psi (\alpha ), }$$
(3.2.1)

where

$$\displaystyle{ \psi (\alpha ) = \frac{\varGamma ^{\prime}(\alpha )} {\varGamma (\alpha )} }$$
(3.2.2)

is the digamma function. Using that the integral of \(p_{{\alpha,\beta }}(x)\) is unity, we have

$$\displaystyle{\int _{0}^{\infty }x^{\alpha -1}\,e^{-\frac{x} {\beta } }\,dx =\beta ^{\alpha }\,\varGamma (\alpha ),}$$

and differentiating with respect to α, it follows

$$\displaystyle{ \int _{0}^{\infty }\ln x\,x^{\alpha -1}\,e^{-\frac{x} {\beta } }\,dx =\ln \beta \,\beta ^{\alpha }\,\varGamma (\alpha ) +\beta ^{\alpha }\,\varGamma ^{\prime}(\alpha ). }$$
(3.2.3)

Dividing by \(\beta ^{\alpha }\varGamma (\alpha )\) yields relation (3.2.1).

Since

$$\displaystyle{\ln p_{{\alpha,\beta }}(x) = -\alpha \,\ln \beta -\ln \varGamma (\alpha ) + (\alpha -1)\ln x -\frac{x} {\beta },}$$

using \(\int _{0}^{\infty }p_{{\alpha,\beta }}(x)\,dx = 1\), \(\int _{0}^{\infty }x\,p_{{\alpha,\beta }}(x)\,dx =\alpha \beta\) and (3.2.1), the entropy becomes

$$\displaystyle\begin{array}{rcl} H(\alpha,\beta )& =& -\int _{0}^{\infty }p_{{\alpha,\beta }}(x)\ln p_{{\alpha,\beta }}(x)\,dx {}\\ & =& \alpha \,\ln \beta +\ln \varGamma (\alpha ) - (\alpha -1)\,\int _{0}^{\infty }\ln x\,p_{{\alpha,\beta }}(x)\,dx {}\\ & & +\frac{1} {\beta } \,\int _{0}^{\infty }x\,p_{{\alpha,\beta }}(x)\,dx {}\\ & =& \ln \beta +(1-\alpha )\psi (\alpha ) +\ln \varGamma (\alpha ) +\alpha. {}\\ \end{array}$$

Example 3.2.5 (Beta Distribution)

The beta distribution on \(\mathcal{X} = [0,1]\) is defined by the density

$$\displaystyle{p_{a,b}(x) = \frac{1} {B(a,b)}\,\,x^{a-1}(1 - x)^{b-1},}$$

with a, b > 0 and beta function given by

$$\displaystyle{ B(a,b) =\int _{ 0}^{1}x^{a-1}(1 - x)^{b-1}\,dx. }$$
(3.2.4)

Differentiating with respect to a and b in (3.2.4) yields

$$\displaystyle\begin{array}{rcl} \partial _{a}B(a,b)& =& \int _{0}^{1}\ln x\,x^{a-1}(1 - x)^{b-1}\,dx {}\\ \partial _{b}B(a,b)& =& \int _{0}^{1}\ln (1 - x)\,x^{a-1}(1 - x)^{b-1}\,dx. {}\\ \end{array}$$

Using

$$\displaystyle{\ln p_{a,b} = -\ln B(a,b) + (a - 1)\ln x + (b - 1)\ln (1 - x),}$$

we find

$$\displaystyle\begin{array}{rcl} H(a,b)& =& -\int _{0}^{1}p_{ a,b}(x)\,\ln p_{a,b}(x)\,dx \\ & =& \ln B(a,b) - \frac{a - 1} {B(a,b)}\int _{0}^{1}\ln x\,x^{a-1}(1 - x)^{b-1}\,dx \\ & & - \frac{b - 1} {B(a,b)}\int _{0}^{1}\ln (1 - x)\,x^{a-1}(1 - x)^{b-1}\,dx \\ & =& \ln B(a,b) - (a - 1)\frac{\partial _{a}B(a,b)} {B(a,b)} - (b - 1)\frac{\partial _{b}B(a,b)} {B(a,b)} \\ & =& \ln B(a,b) - (a - 1)\partial _{a}\ln B(a,b) - (b - 1)\partial _{b}\ln B(a,b).{}\end{array}$$
(3.2.5)

We shall express the entropy in terms of digamma function (3.2.2). Using the expression of the beta function in terms of gamma functions

$$\displaystyle{B(a,b) = \frac{\varGamma (a)\varGamma (b)} {\varGamma (a + b)},}$$

we have

$$\displaystyle{\ln B(a,b) =\ln \varGamma (a) +\ln \varGamma (b) -\ln \varGamma (a + b).}$$

The partial derivatives of the function B(a, b) are

$$\displaystyle\begin{array}{rcl} \partial _{a}\ln B(a,b) =\psi (a) -\psi (a + b)& &{}\end{array}$$
(3.2.6)
$$\displaystyle\begin{array}{rcl} \partial _{b}\ln B(a,b) =\psi (b) -\psi (a + b).& &{}\end{array}$$
(3.2.7)

Substituting in (3.2.5) yields

$$\displaystyle{ H(a,b) =\ln B(a,b) + (a + b - 2)\psi (a + b) - (a - 1)\psi (a) - (b - 1)\psi (b). }$$
(3.2.8)

For example

$$\displaystyle\begin{array}{rcl} H(1/2,1/2)& =& \ln \sqrt{2} +\ln \sqrt{2} -\psi (1) +\psi (1/2) {}\\ & =& \ln 2 +\gamma -2\ln 2-\gamma = -\ln 2 < 0, {}\\ \end{array}$$

where we used

$$\displaystyle{\psi (1) = -\gamma = -0.5772\ldots,\qquad \psi (1/2) = -2\ln 2 -\gamma.}$$

It can be shown that the entropy is always non-positive, see Problem 3.22. For a = b = 1 the entropy vanishes

$$\displaystyle{H(1,1) =\ln \varGamma (1) +\ln \varGamma (1) -\ln \varGamma (2) = 0.}$$

Example 3.2.6 (Lognormal Distribution)

The lognormal distribution

$$\displaystyle{p_{\mu,\sigma }(x) = \frac{1} {\sqrt{2\pi }\sigma x}e^{-\frac{(\ln x-\mu )^{2}} {2\sigma ^{2}} },\,\,(\mu,\sigma ) \in (0,\infty ) \times (0,\infty )}$$

defines a statistical model on the sample space \(\mathcal{X} = (0,\infty )\). First, using the substitution y = lnxμ, we have

$$\displaystyle\begin{array}{rcl} \int _{0}^{\infty }\ln x\,p_{\mu,\sigma }(x)\,dx& =& \int _{0}^{\infty }(\ln x-\mu )\,p_{\mu,\sigma }(x)\,dx +\mu {}\\ & =& \int _{-\infty }^{+\infty } \frac{1} {\sqrt{2\pi }y\sigma }e^{-\frac{y^{2}} {2\sigma ^{2}} }\,dy+\mu =\mu. {}\\ \int _{0}^{\infty }(\ln x-\mu )^{2}\,p_{\mu,\sigma }(x)\,dx& =& \int _{-\infty }^{+\infty } \frac{1} {\sqrt{2\pi }\sigma }e^{-\frac{y^{2}} {2\sigma ^{2}} }y^{2}\,dy =\sigma ^{2}. {}\\ \end{array}$$

Using

$$\displaystyle{\ln p_{\mu,\sigma } = -\ln (\sqrt{2\pi }\sigma ) -\ln x - (\ln x-\mu )^{2} \frac{1} {2\sigma ^{2}},}$$

and the previous integrals, the entropy becomes

$$\displaystyle\begin{array}{rcl} H(\mu,\sigma )& =& -\int _{0}^{\infty }p_{\mu,\sigma }(x)\,\ln p_{\mu,\sigma }(x)\,dx {}\\ & =& \ln (\sqrt{2\pi }\sigma ) +\int _{ 0}^{\infty }\ln x\,p_{\mu,\sigma }(x)\,dx {}\\ & & + \frac{1} {2\sigma ^{2}}\int _{0}^{\infty }(\ln x-\mu )^{2}p_{\mu,\sigma }(x)\,dx {}\\ & =& \ln (\sqrt{2\pi }) +\ln \sigma +\mu + \frac{1} {2}. {}\\ \end{array}$$

Example 3.2.7 (Dirac Distribution)

A Dirac distribution on (a, b) centered at x 0 ∈ (a, b) represents the density of an idealized point mass x 0. This can be thought of as an infinitely high, infinitely thin spike at x 0, with total area under the spike equal to 1. The Dirac distribution centered at x 0 is customarily denoted by p(x) = δ(xx 0), and its relation with the integral can be written informally as

  1. (i)

    \(\int _{a}^{b}p(x)\,dx =\int _{ a}^{b}\delta (x - x_{ 0})\,dx = 1;\)

  2. (ii)

    \(\int _{a}^{b}g(x)p(x)\,dx =\int _{ a}^{b}g(x)\delta (x - x_{ 0})\,dx = g(x_{0})\),

for any continuous function g(x) on (a, b).

The k-th moment is given by

$$\displaystyle{m_{k} =\int _{ a}^{b}x^{k}\delta (x - x_{ 0})\,dx = x_{0}^{k}.}$$

Then the mean of the Dirac distribution is μ = x 0 and the variance is \(V ar = m_{2} - (m_{1})^{2} = 0\). The underlying random variable, which is Dirac distributed, is a constant equal to x 0.

In order to compute the entropy of δ(xx 0), we shall approximate the distribution by a sequence of distributions \(\varphi _{\epsilon }(x)\) for which we can easily compute the entropy. For any ε > 0, consider the distribution

$$\displaystyle{\varphi _{\epsilon }(x) = \left \{\begin{array}{ll} \frac{1} {\epsilon },&\mbox{ if}\;\vert x\vert <\epsilon /2\\ \\ 0, &\mbox{ otherwise,} \end{array} \right.}$$

with the entropy given by

$$\displaystyle\begin{array}{rcl} H_{\epsilon }& =& -\int _{a}^{b}\varphi _{ \epsilon }(x)\ln \varphi _{\epsilon }(x)\,dx {}\\ & =& -\int _{x_{0}-\epsilon /2}^{x_{2}+\epsilon /2}\frac{1} {\epsilon } \ln \frac{1} {\epsilon } \,dx {}\\ & =& \ln \epsilon. {}\\ \end{array}$$

Since \(\lim _{\epsilon \searrow 0}\varphi _{\epsilon } =\delta (x - x_{0})\), by the Dominated Convergence Theorem the entropy of δ(xx 0) is given by the limit

$$\displaystyle{H =\lim _{\epsilon \searrow 0}H_{\epsilon } =\lim _{\epsilon \searrow 0}\ln \epsilon = -\infty.}$$

In conclusion, the Dirac distribution has the lowest possible entropy. Heuristically, this is because of the lack of disorganization of the associated random variable, which is a constant.

3 Entropy on Products of Statistical Models

Consider the statistical manifolds \(\mathcal{S}\) and \(\mathcal{U}\) and let \(\mathcal{S}\times \mathcal{U}\) be their product model, see Example 1.3.9. Any density function \(f \in \mathcal{S}\times \mathcal{U}\), with f(x, y) = p(x)q(y), \(p \in \mathcal{S}\), \(q \in \mathcal{U}\), has the entropy

$$\displaystyle\begin{array}{rcl} H_{\mathcal{S}\times \mathcal{U}}(f)& =& -\int \!\!\!\int _{\mathcal{X}\times \mathcal{Y}}f(x,y)\ln f(x,y)\,dxdy {}\\ & & -\int \!\!\!\int _{\mathcal{X}\times \mathcal{Y}}p(x)q(y)[\ln p(x) +\ln q(y)]\,dxdy {}\\ & =& -\int _{\mathcal{Y}}q(y)\,dy\,\int _{\mathcal{X}}p(x)\ln p(x)\,dx {}\\ & & -\int _{\mathcal{X}}p(x)\,dx\,\int _{\mathcal{Y}}q(y)\ln q(y)\,dy {}\\ & =& H_{\mathcal{S}}(p) + H_{\mathcal{U}}(q), {}\\ \end{array}$$

i.e., the entropy of an element of the product model \(\mathcal{S}\times \mathcal{U}\) is the sum of the entropies of the projections on \(\mathcal{S}\) and \(\mathcal{U}\). This can be also stated by saying that the joint entropy of two independent random variables X and Y is the sum of individual entropies, i.e.

$$\displaystyle{H(X,Y ) + H(X) + H(Y ),}$$

see Problem 3.5 for details.

4 Concavity of Entropy

Theorem 3.4.1

For any two densities \(p,q: \mathcal{X} \rightarrow \mathbb{R}\) we have

$$\displaystyle{ H(\alpha p +\beta q) \geq \alpha H(p) +\beta H(q), }$$
(3.4.9)

∀α,β ∈ [0,1], with α + β = 1.

Proof:

Using that f(u) = −ulnu is concave on (0, ), we obtain

$$\displaystyle{f(\alpha p +\beta q) \geq \alpha f(p) +\beta f(q).}$$

Integrating (summing) over \(\mathcal{X}\) leads to expression (3.4.9). ■ 

With a similar proof we can obtain the following result.

Corollary 3.4.2

For any densities \(p_{1},\ldots,p_{n}\) on \(\mathcal{X}\) and λ i ∈ [0,1] with \(\lambda _{1} +\ldots +\lambda _{n} = 1\) , we have

$$\displaystyle{H\Big(\sum _{i=1}^{n}\lambda _{ i}p_{i}\Big) \geq \sum _{i=1}^{n}\lambda _{ i}H(p_{i}).}$$

The previous result suggests to look for the maxima of the entropy function on a statistical model.

5 Maxima for Entropy

Let \(\mathcal{S} =\{ p_{\xi }(x);x \in \mathcal{X},\xi \in \mathbb{E}\}\) be a statistical model. We can regard the entropy H as a function defined on the parameters space \(\mathbb{E}\). We are interested in the value of the parameter ξ for which the entropy H(ξ) has a local maximum. This parameter value corresponds to a distinguished density p ξ . Sometimes, the density p ξ satisfies some given constraints, which are provided by the given observations, and has a maximum degree of ignorance with respect to the unknown observations. This type of optimization problem is solved by considering the maximization of the entropy with constraints. In order to study this problem we shall start with the definition and characterization of critical points of entropy.

Let f be a function defined on the statistical manifold S = { p ξ }. If \(\partial _{i} = \partial _{\xi ^{i}}\) denotes the tangent vector field on S in the direction of ξ i, then

$$\displaystyle{\partial _{i}f =: \partial _{\xi ^{i}}f:= \partial _{\xi ^{i}}(f \circ p_{{\xi }}).}$$

In the following the role of the function f is played by the entropy H(ξ) = H(p ξ ).

Definition 3.5.1

A point q ∈ S is a critical point for the entropy H if

$$\displaystyle{X(H) = 0,\quad \forall X \in T_{q}S.}$$

Since \(\{\partial _{i}\}_{i}\) form a basis, choosing X =  i , we obtain that the point q = p ξ  ∈ S is a critical point for H if and only if

$$\displaystyle{\partial _{i}H(\xi ) = 0,\qquad i = 1,2,\ldots,n.}$$

A computation provides

$$\displaystyle\begin{array}{rcl} \partial _{i}H& =& -\partial _{i}\int _{\mathcal{X}}p(x,\xi )\ln p(x,\xi )\,dx {}\\ & =& -\int _{\mathcal{X}}\Big(\partial _{i}p(x,\xi )\,\ln p(x,\xi ) + p(x,\xi )\frac{\partial _{i}p(x,\xi )} {p(x,\xi )} \Big)\,dx {}\\ & =& -\int _{\mathcal{X}}\big(\ln p(x,\xi ) + 1\big)\partial _{i}p(x,\xi )\,dx {}\\ & =& -\int _{\mathcal{X}}\ln p(x,\xi )\,\partial _{i}p(x,\xi )\,dx, {}\\ \end{array}$$

where we used that

$$\displaystyle{\int _{\mathcal{X}}p(x,\xi )\,dx = 1}$$

and

$$\displaystyle{0 = \partial _{i}\int _{\mathcal{X}}p(x,\xi )\,dx =\int _{\mathcal{X}}\partial _{i}p(x,\xi )\,dx.}$$

The previous computation can be summarized as in the following.

Proposition 3.5.2

The probability distribution p ξ is a critical point of the entropy H if and only if

$$\displaystyle{ \int _{\mathcal{X}}\ln p(x,\xi )\,\partial _{\xi ^{i}}p(x,\xi )\,dx = 0,\quad \forall i = 1,\ldots,m. }$$
(3.5.10)

In the discrete case, when \(\mathcal{X} =\{ x^{1},\ldots,x^{n}\}\) , the Eq. (3.5.10) is replaced by the relation

$$\displaystyle{ \sum _{k=1}^{n}\ln p(x^{k},\xi )\,\partial _{ i}p(x^{k},\xi ) = 0,\quad \forall i = 1,\ldots,m. }$$
(3.5.11)

Observe that the critical points characterized by the previous result do not belong to the boundary. The entropy, which is a concave function, on a convex set (such as a mixture family) sometimes attains the local minima along the boundary. Even if these points are called critical by some authors, here we do not consider them as part of our analysis.

The first derivative of the entropy can be also expressed in terms of the log-likelihood function as in the following

$$\displaystyle\begin{array}{rcl} \partial _{i}H& =& -\int _{\mathcal{X}}\ln p(x,\xi )\,\partial _{\xi ^{i}}p(x,\xi )\,dx \\ & =& -\int _{\mathcal{X}}p(x,\xi )\ln p(x,\xi )\,\partial _{i}\ln p(x,\xi )\,dx \\ & =& -\int _{\mathcal{X}}p(x,\xi )\ell(\xi )\,\partial _{i}\ell(\xi )\,dx \\ & =& -E_{\xi }[\ell(\xi )\,\partial _{\xi ^{i}}\ell(\xi )]. {}\end{array}$$
(3.5.12)

The goal of this section is to characterize the distributions p ξ for which the entropy is maximum. Minima and maxima are among the set of critical points, see Definition 3.5.1. In order to deal with this issue we need to compute the Hessian of the entropy H.

The second order partial derivatives of the entropy H are

$$\displaystyle\begin{array}{rcl} \partial _{ji}H& =& \partial _{j}\int _{\mathcal{X}}\ln p(x,\xi )\,\partial _{i}p(x,\xi )\,dx {}\\ & =& -\int _{\mathcal{X}}\Big(\frac{\partial _{j}p(x)} {p(x)} \partial _{i}p(x) +\ln p(x)\;\partial _{i}\partial _{j}p(x)\Big)\,dx {}\\ & =& -\int _{\mathcal{X}}\Big( \frac{1} {p(x)}\partial _{i}p(x)\;\partial _{j}p(x) +\ln p(x)\;\partial _{ji}p(x)\Big)\,dx. {}\\ \end{array}$$

In the discrete case this becomes

$$\displaystyle{ \partial _{ji}H = -\sum _{k=1}^{n}\Big(\frac{\partial _{i}p(x^{k},\xi )\;\partial _{ j}p(x^{k},\xi )} {p(x^{k},\xi )} +\ln p(x_{k},\xi )\;\partial _{ij}p(x^{k},\xi )\Big). }$$
(3.5.13)

We can also express the Hessian of the entropy in terms of the log-likelihood function only. Differentiating in (3.5.12) we have

$$\displaystyle\begin{array}{rcl} \partial _{ji}H& =& -\partial _{j}\int _{\mathcal{X}}p(x,\xi )\ell(\xi )\,\partial _{i}\ell(\xi )\,dx {}\\ & =& -\int _{\mathcal{X}}\Big(\partial _{j}p(x,\xi )\ell(\xi )\,\partial _{i}\ell(\xi ) + p(x,\xi )\partial _{j}\ell(\xi )\,\partial _{i}\ell(\xi ) {}\\ & & +p(x,\xi )\ell(\xi )\,\partial _{i}\partial _{j}\ell(\xi )\Big)\,dx {}\\ & =& -E_{\xi }[\partial _{i}\ell\,\partial _{j}\ell] - E_{\xi }[(\partial _{j}\ell(\xi )\partial _{i}\ell(\xi ) + \partial _{i}\partial _{j}\ell(\xi ))\ell(\xi )] {}\\ & =& -g_{ij}(\xi ) - h_{ij}(\xi ). {}\\ \end{array}$$

We arrived at the following result that relates the entropy and the Fisher information.

Proposition 3.5.3

The Hessian of the entropy is given by

$$\displaystyle{ \partial _{i}\partial _{j}H(\xi ) = -g_{ij}(\xi ) - h_{ij}(\xi ), }$$
(3.5.14)

where g ij (ξ) is the Fisher–Riemann metric and

$$\displaystyle{h_{ij}(\xi ) = E_{\xi }[(\partial _{j}\ell(\xi )\partial _{i}\ell(\xi ) + \partial _{i}\partial _{j}\ell(\xi ))\ell(\xi )].}$$

Corollary 3.5.4

In the case of the mixture family ( 1.5.15 )

$$\displaystyle{ p(x;\xi ) = C(x) +\xi ^{i}F_{ i}(x) }$$
(3.5.15)

the Fisher–Riemann metric is given by

$$\displaystyle{ g_{ij}(\xi ) = -\partial _{i}\partial _{j}H(\xi ). }$$
(3.5.16)

Furthermore, any critical point of the entropy (see Definition 3.5.1) is a maximum point.

Proof:

From Proposition 1.5.1, part (iii) we have \(\partial _{i}\partial _{j}\ell_{x}(\xi )\) \(= -\partial _{i}\ell_{x}(\xi )\,\partial _{j}\ell_{x}(\xi )\) which implies h ij (ξ) = 0. Substituting in (3.5.14) yields (3.5.16). Using that the Fisher–Riemann matrix g ij (ξ) is positive definite at any ξ, it follows that \(\partial _{i}\partial _{j}H(\xi )\) is globally negative definite, and hence all critical points must be maxima. We also note that we can express the Hessian in terms of F j as in the following

$$\displaystyle{\partial _{i}\partial _{j}H(\xi ) = -\int _{\mathcal{X}}\frac{F_{i}(x)F_{j}(x)} {p(x;\xi )} \,dx.}$$

 ■ 

A Hessian \(Hess(F) = (\partial _{ij}F)\) is called positive definite if and only if \(\sum _{i,j}\partial _{ij}F\,v^{i}v^{j} > 0\), or, equivalently,

$$\displaystyle{\langle Hess(F)v,v\rangle > 0,\qquad \forall v \in \mathbb{R}^{m}.}$$

In the following we shall deal with the relationship between the Hessian and the second variation of the entropy H.

Consider a curve ξ(s) in the parameter space and let \(\big(\xi _{u}(s)\big)_{\vert u\vert <\epsilon }\) be a smooth variation of the curve with \(\xi _{u}(s)_{\vert u=0} =\xi (s)\). Then \(s \rightarrow p_{\xi _{u}(s)}\) is a variation of the curve \(s \rightarrow p_{{\xi (s)}}\) on the statistical manifold S. Consider the variation

$$\displaystyle{\xi _{u}(s) =\xi (s) + u\eta (s),}$$

so \(\partial _{u}\xi _{u}(s) =\eta (s)\) and \(\partial _{u}^{2}\xi _{u}(s) = 0\). The second variation of the entropy along the curve \(s \rightarrow p_{\xi _{u}(s)}\) is

$$\displaystyle\begin{array}{rcl} \frac{d^{2}} {du^{2}}H\big(\xi _{u}(s)\big)& =& \frac{d} {du}\langle \partial _{\xi }H,\partial _{u}\xi _{u}(s)\rangle {}\\ & =& \langle \frac{d} {du}\partial _{\xi }H,\partial _{u}\xi (s)\rangle +\langle \partial _{\xi }H,\mathop{\underbrace{\partial _{u}^{2}\xi _{u}(s)}}\limits _{=0}\rangle {}\\ & =& \frac{d} {du}(\partial _{i}H)\,\partial _{u}\xi ^{i}(s) {}\\ & =& \partial _{i}\partial _{j}H(\xi _{u}(s)) \cdot \partial _{u}\xi _{u}^{i}(s)\partial _{ u}\xi _{u}^{j}(s). {}\\ \end{array}$$

Taking u = 0, we find

$$\displaystyle\begin{array}{rcl} \frac{d^{2}} {du^{2}}H\big(\xi _{u}(s)\big)_{\mid u=0}& =& \partial _{ij}H\big(\xi (s)\big)\eta ^{i}(s)\eta ^{j}(s) {}\\ & =& \langle Hess\,H\big(\xi (s)\big)\eta,\eta \rangle. {}\\ \end{array}$$

Hence \(\frac{d^{2}} {du^{2}} H\big(\xi _{u}(s)\big)_{\mid u=0} < 0(> 0)\) if and only if Hess(H) is negative (positive) definite. Summarizing, we have:

Theorem 3.5.5

If ξ is such that p ξ satisfies the critical point condition (3.5.10) (or condition (3.5.11) in the discrete case), and the Hessian Hess(H(ξ)) is negative definite at ξ, then p ξ is a local maximum point for the entropy.

We shall use this result in the next section.

Corollary 3.5.6

Let ξ 0 be such that

$$\displaystyle{ E_{\xi _{0}}[\ell(\xi _{0})\partial _{i}\ell(\xi _{0})] = 0 }$$
(3.5.17)

and h ij 0 ) is positive definite. Then p(x,ξ 0 ) is a distribution for which the entropy reaches a local maximum.

Proof:

In the virtue of (3.5.12) the Eq. (3.5.17) is equivalent with the critical point condition \(\partial _{i}H(\xi )_{\vert \xi =\xi _{0}} = 0\). Since g ij (ξ 0) is positive definite, then (3.5.14) implies that i j H(ξ 0) is negative definite. Then applying Theorem 3.5.5 ends the proof. ■ 

6 Weighted Coin

Generally, for discrete distributions we may identify the statistical space \(\mathcal{S}\) with the parameter space \(\mathbb{E}\). We shall present next the case of a simple example where the entropy can be maximized. Flipping a weighted coin provides either heads with probability ξ 1, or tails with probability ξ 2 = 1 −ξ 1. The statistical manifold obtained this way depends on only one essential parameter ξ: = ξ 1. Since \(\mathcal{X} =\{ x_{1} = heads,x_{2} = tails\}\), the manifold is just a curve in \(\mathbb{R}^{2}\) parameterized by ξ ∈ [0, 1]. The probability distribution of the weighted coin is given by the table

Table 1

We shall find the points of maximum entropy. First we write the Eq. (3.5.11) to determine the critical points

$$\displaystyle\begin{array}{rcl} \ln p(x_{1},\xi )\,\partial _{\xi }p(x_{1},\xi ) +\ln p(x_{2},\xi )\,\partial _{\xi }p(x_{2},\xi )& =& 0\Longleftrightarrow {}\\ \ln \xi -\ln (1-\xi )& =& 0\Longleftrightarrow {}\\ \xi & =& 1-\xi {}\\ \end{array}$$

and hence there is only one critical point, \(\xi = \frac{1} {2}\).

The Hessian has only one component, so formula (3.5.13) yields

$$\displaystyle\begin{array}{rcl} \partial _{\xi }^{2}H& =& -\Big( \frac{1} {p(x_{1})}\big(\partial _{\xi }p(x_{1})\big)^{2} +\ln p(x_{ 1})\partial _{\xi }^{2}p(x_{ 1})\Big) {}\\ & & -\Big( \frac{1} {p(x_{2})}\big(\partial _{\xi }p(x_{2})\big)^{2} +\ln p(x_{ 2})\partial _{\xi }^{2}p(x_{ 2})\Big) {}\\ & =& -\Big(\frac{1} {\xi } \cdot 1 +\ln \xi \cdot 0\Big) {}\\ & & -\Big( \frac{1} {1-\xi }\big(\partial _{\xi }(1-\xi )\big)^{2} +\ln (1-\xi )\;\partial _{\xi }^{2}(1-\xi )\Big) {}\\ & =& -\Big(\frac{1} {\xi } + \frac{1} {1-\xi }\Big). {}\\ \end{array}$$

Evaluating at the critical point, we get

$$\displaystyle{\partial _{\xi }^{2}H_{\mid \xi =\frac{1} {2} } = -4 < 0,}$$

and hence \(\xi = \frac{1} {2}\) is a maximum point for the entropy. In this case \(\xi ^{1} =\xi ^{2} = \frac{1} {2}\). This can be restated by saying that the fair coin has the highest entropy among all weighted coins.

7 Entropy for Finite Sample Space

Again, we underline that for discrete distributions we identify the statistical space \(\mathcal{S}\) with the parameter space \(\mathbb{E}\).

Consider a statistical model with a finite discrete sample space \(\mathcal{X} =\{ x^{1},\ldots,x^{n+1}\}\) and associated probabilities p(x i) = ξ i, ξ i ∈ [0, 1], \(i = 1,\ldots n + 1\). Since \(\xi ^{n+1} = 1 -\sum _{i=1}^{n}\xi ^{i}\), the statistical manifold is described by n essential parameters, and hence it has n dimensions. The manifold can be also seen as a hypersurface in \(\mathbb{R}^{n+1}\). The entropy function is

$$\displaystyle{ H = -\sum _{i=1}^{n+1}\xi ^{i}\ln \xi ^{i}. }$$
(3.7.18)

The following result deals with the maximum entropy condition. Even if it can be derived from the concavity property of H, see Theorem 3.4.1, we prefer to deduct it here in a direct way. We note that concavity is used as a tool to derive the case of continuous distributions, see Corollary 5.9.3.

Theorem 3.7.1

The entropy (3.7.18) is maximum if and only if

$$\displaystyle{ \xi ^{1} =\ldots =\xi ^{n+1} = \frac{1} {n + 1}. }$$
(3.7.19)

Proof:

The critical point condition (3.5.11) becomes

$$\displaystyle\begin{array}{rcl} \sum _{k=1}^{n}\ln p(x^{k},\xi )\partial _{\xi ^{ i}}p(x^{k},\xi ) +\ln p(x^{n+1},\xi )\;\partial _{\xi ^{ i}}p(x^{n+1},\xi )& =& 0\quad \Longleftrightarrow {}\\ \sum _{k=1}^{n}\ln \xi ^{k}\;\delta _{ ik} +\ln \xi ^{n+1}\;\partial _{\xi ^{ n+1}}(1 -\xi ^{1} -\ldots -\xi ^{n})& =& 0\Longleftrightarrow {}\\ \ln \xi ^{i} -\ln \xi ^{n+1}& =& 0\Longleftrightarrow {}\\ \xi ^{i}& =& \xi ^{n+1},\qquad {}\\ \end{array}$$

\(\forall i = 1,\ldots,n\). Hence condition (3.7.19) follows.

We shall investigate the Hessian at this critical point. Following formula (3.5.13) yields

$$\displaystyle\begin{array}{rcl} Hess(H)_{ij}& =& -\sum _{k=1}^{n}\frac{\partial _{i}(\xi ^{k}) \cdot \partial _{ j}(\xi ^{k})} {\xi ^{k}} -\frac{\partial _{i}(\xi ^{n+1}) \cdot \partial _{j}(\xi ^{n+1})} {\xi ^{n+1}} {}\\ & & -\sum _{k=1}^{n}\ln \xi ^{k}\;\partial _{ i}\partial _{j}(\xi ^{k}) -\ln \xi ^{n+1}\;\partial _{ i}\partial _{j}(\xi ^{n+1}) {}\\ & =& -\Big(\sum _{k=1}^{n}\frac{\delta _{ik}\delta _{jk}} {\xi ^{k}} - \frac{1} {\xi ^{n+1}}\Big), {}\\ \end{array}$$

where we have used \(\partial _{i}(\xi ^{n+1}) = \partial _{i}(1 -\xi ^{1} -\ldots -\xi ^{n}) = -1\), for \(i = 1,\ldots,n\).

At the critical point the Hessian is equal to

$$\displaystyle{{Hess(H)_{ij}}_{\vert _{\xi _{ k}= \frac{1} {n+1} }} = -(n + 1)\Big(1 +\sum _{ k=1}^{n}\delta _{ ik}\delta _{jk}\Big) = -2(n + 1)I_{n},}$$

which shows that it is negative definite. Theorem 3.5.5 leads to the desired conclusion. ■ 

Example 3.7.2

Let ξ i be the probability that a die lands with the face i up. This model depends on five essential parameters. According to the previous result, the fair die is the one which maximizes the entropy.

8 A Continuous Distribution Example

Let \(p(x;\xi ) = 2\xi x + 3(1-\xi )x^{2}\) be a continuous probability distribution function, with x ∈ [0, 1]. The statistical manifold defined by the above probability distribution is one dimensional, since \(\xi \in \mathbb{R}\). There is only one basic vector field equal to

$$\displaystyle{\partial _{\xi } = 2x - 3x^{2},}$$

and which does not depend on ξ. In order to find the critical points, we follow Eq. (3.5.10)

$$\displaystyle\begin{array}{rcl} \int _{0}^{1}p(x,\xi )\,\partial _{\xi }p(x,\xi )\,dx& =& 0\Longleftrightarrow {}\\ \int _{0}^{1}(2x - 3x^{2})(2\xi x + 3(1-\xi )x^{2})\,dx& =& 0\Longleftrightarrow {}\\ \frac{2} {15}\xi - \frac{3} {10}& =& 0\Longleftrightarrow\xi = \frac{9} {4}. {}\\ \end{array}$$

Before investigating the Hessian, we note that

$$\displaystyle{\partial _{\xi }p(x;\xi ) = 2x - 3x^{2},\quad \partial _{\xi }^{2}p(x;\xi ) = 0,\quad p\Big(x; \frac{9} {4}\Big) = \frac{9} {4}x -\frac{15} {4} x^{2},\quad }$$

so

$$\displaystyle\begin{array}{rcl} \partial _{\xi }^{2}H_{\mid \xi =\frac{9} {4} }& =& -\int _{0}^{1}\Big(\frac{1} {p}(\partial _{\xi }p)^{2} +\ln p\;\partial _{\xi }^{2}p\Big)\,dx_{\Big \vert \xi =\frac{9} {4} } {}\\ & =& -\int _{0}^{1}\frac{(2x - 3x^{2})^{2}} {\frac{9} {2}x -\frac{15} {4} x^{2}} \,dx < 0, {}\\ \end{array}$$

because \(\frac{9} {2}x -\frac{15} {4} x^{2} < 0\) for x ∈ (0, 1].

Hence \(\xi = \frac{9} {4}\) is a maximum point for the entropy. The maximum value of the entropy is

$$\displaystyle\begin{array}{rcl} H\Big(\frac{9} {4}\Big)& =& -\int _{0}^{1}\big(\frac{9} {2}x -\frac{15} {4} x^{2}\big)\ln \big(\frac{9} {2}x -\frac{15} {4} x^{2}\big)\,dx {}\\ & =& -\frac{52} {25}\ln 3 + \frac{47} {30} + \frac{23} {25}\ln 2 {}\\ & =& -0.807514878. {}\\ \end{array}$$

9 Upper Bounds for Entropy

We shall start with computing a rough upper bound for the entropy in the case when the sample space is a finite interval, \(\mathcal{X} = [a,b]\). Consider the convex function

$$\displaystyle{f: [0,1] \rightarrow \mathbb{R},\qquad f(u) = \left \{\begin{array}{ccc} u\ln u&if &u \in (0,1]\\ \ 0 &if & u = 0. \end{array} \right.}$$

Since f′(u) = 1 + lnu, u ∈ (0, 1), the function has a global minimum at u = 1∕e, and hence ulnu ≥ −1∕e, see Fig. 3.1.

Let \(p: \mathcal{X} \rightarrow \mathbb{R}\) be a probability density. Substituting u = p(x) yields p(x) lnp(x) ≥ −1∕e. Integrating, we find

$$\displaystyle{\int _{a}^{b}p(x)\ln p(x)\,dx \geq -\frac{b - a} {e}.}$$

Using the definition of the entropy we obtain the following upper bound.

Figure 3.1:
figure 1

The function x → xlnx has a global minimum value equal to − 1∕e that is reached at x = 1∕e

Proposition 3.9.1

The entropy H(p) of a probability distribution p: [a,b] → [0,∞) satisfies the inequality

$$\displaystyle{ H(p) \leq \frac{b - a} {e}. }$$
(3.9.20)

Corollary 3.9.2

The entropy H(p) is smaller than half the length of the domain interval of the distribution p, i.e.,

$$\displaystyle{H(p) \leq \frac{b - a} {2}.}$$

This implies that the entropy H(p) is smaller than the mean of the uniform distribution.

We note that the inequality (3.9.20) becomes identity for the uniform distribution p: [0, e] → [0, ), p(x) = 1∕e, see Problem 3.20. We shall present next another upper bound which is reached for all uniform distributions.

Theorem 3.9.3

The entropy of a smooth probability distribution p: [a,b] → [0,∞) satisfies the inequality

$$\displaystyle{ H(p) \leq \ln (b - a). }$$
(3.9.21)

Proof:

Since the function

$$\displaystyle{f: [0,1] \rightarrow \mathbb{R},\qquad f(u) = \left \{\begin{array}{ccc} u\ln u&if &u \in (0,1]\\ \ 0 &if & u = 0 \end{array} \right.}$$

is convex on [0, ), an application of Jensen integral inequality yields

$$\displaystyle\begin{array}{rcl} f\Big( \frac{1} {b - a}\int _{a}^{b}p(x)\,dx\Big)& \leq & \frac{1} {b - a}\int _{a}^{b}f\big(p(x)\big)\,dx\Longleftrightarrow {}\\ f\Big( \frac{1} {b - a}\Big)& \leq & \frac{1} {b - a}\int _{a}^{b}p(x)\ln p(x)\,dx\Longleftrightarrow {}\\ \ln \Big( \frac{1} {b - a}\Big)& \leq & \int _{a}^{b}p(x)\ln p(x)\,dx\Longleftrightarrow {}\\ -\ln (b - a)& \leq & -H(p), {}\\ \end{array}$$

which is equivalent to (3.9.21). The identity is reached for the uniform distribution p(x) = 1∕(ba). ■ 

The above result states that the maximum entropy is realized only for the case of the uniform distribution. In other words, the entropy measures the closeness of a distribution to the uniform distribution.

Since we have the inequality

$$\displaystyle{\ln x \leq \frac{x} {e},\quad \quad \forall x > 0}$$

with equality only for x = e, see Fig. 3.2, it follows that the inequality (3.9.21) provides a better bound than (3.9.20).

In the following we shall present the bounds of the entropy in terms of the maxima and minima of the probability distribution. We shall use the following inequality involving the weighted average of n numbers.

Lemma 3.9.4

If \(\lambda _{1},\ldots,\lambda _{n} > 0\) and \(\alpha _{1},\ldots,\alpha _{n} \in \mathbb{R}\) , then

$$\displaystyle{\min _{j}\{\alpha _{j}\} \leq \frac{\sum _{i}\lambda _{i}\alpha _{i}} {\sum _{i}\lambda _{i}} \leq \max _{j}\{\alpha _{j}\}.}$$
Figure 3.2:
figure 2

The inequality lnx ≤ xe is reached for x = e

This says that if α j are the coordinates of n points of masses λ j , then the coordinate of the center of mass of the system is larger than the smallest coordinate and smaller than the largest coordinate.

Proposition 3.9.5

Consider the discrete probability distribution p ={ p j }, with \(p_{1} \leq \ldots \leq p_{n}\) . Then the entropy satisfies the double inequality

$$\displaystyle{-\ln p_{n} \leq H(p) \leq -\ln p_{1}.}$$

Proof:

Letting λ j  = p j and α j  = −lnp j in Lemma 3.9.4 and using

$$\displaystyle{H(p) = -\sum _{j}p_{j}\ln p_{j} = \frac{\sum _{i}\lambda _{i}\alpha _{i}} {\sum _{i}\lambda _{i}},}$$

we find the desired inequality. ■ 

Remark 3.9.6

The distribution p = { p j } is uniform with \(p_{j} = \frac{1} {n}\) if and only if p 1 = p n . In this case the entropy is given by

$$\displaystyle{H(p) = -\ln p_{1} =\ln p_{n} = -\ln \frac{1} {n} =\ln n.}$$

The continuous analog of Proposition 3.9.5 is given below.

Proposition 3.9.7

Consider the continuous probability distribution \(p: \mathcal{X} \rightarrow [a,b] \subset [0,\infty )\) , with \(p_{m} =\min _{x\in \mathcal{X}}p(x)\) and \(p_{{M}} =\max _{x\in \mathcal{X}}p(x)\) . Then the entropy satisfies the inequality

$$\displaystyle{-\ln p_{{M}} \leq H(p) \leq -\ln p_{m}.}$$

Proof:

The proof is using the following continuous analog of Lemma 3.9.4,

$$\displaystyle{\min _{x\in \mathcal{X}}\alpha (x) \leq \frac{\int _{\mathcal{X}}\lambda (x)\alpha (x)\,dx} {\int _{\mathcal{X}}\lambda (x)\,dx} \leq \max _{x\in \mathcal{X}}\alpha (x),}$$

where we choose α(x) = −lnp(x) and λ(x) = p(x). ■ 

10 Boltzman–Gibbs Submanifolds

Let

$$\displaystyle{\mathcal{S} =\{ p_{\xi }: [0,1]\longrightarrow \mathbb{R}_{+};\;\int _{\mathcal{X}}p_{{\xi }}(x)\,dx = 1\},\quad \xi \in \mathbb{E},}$$

be a statistical model with the state space \(\mathcal{X} = [0,1]\). Let \(\mu \in \mathbb{R}\) be a fixed constant and consider the set of elements of \(\mathcal{S}\) with the mean μ

$$\displaystyle{\mathcal{M}_{\mu } =\{ p_{\xi } \in \mathcal{S};\;\int _{\mathcal{X}}xp_{{\xi }}(x)\,dx =\mu \}.}$$

and assume that \(\mathcal{M}_{\mu }\) is a submanifold of \(\mathcal{S}\).

Definition 3.10.1

The statistical submanifold \(\mathcal{M}_{\mu } =\{ p_{\xi }\}\) defined above is called a Boltzman–Gibbs submanifold of \(\mathcal{S}\) .

Example 3.10.1

In the case of beta distribution, the Boltzman–Gibbs submanifold \(\mathcal{M}_{\mu } =\{ p_{a,ka};a > 0,k = (1-\mu )/\mu \}\) is just a curve. In particular, \(\mathcal{M}_{1} =\{ p_{a,0};a > 0\}\), with \(p_{a,0}(x) = \frac{1} {B(a,0)}x^{a-1}(1 - x)^{-1}\).

One of the problems arised here is to find the distribution of maximum entropy on a Boltzman–Gibbs submanifold. Since the maxima are among critical points, which are introduced by Definition 3.5.1, we shall start the study with finding the critical points of the entropy

$$\displaystyle{H(\xi ) = H(p_{\xi }) = -\int _{\mathcal{X}}p_{{\xi }}(x)\,\ln p_{{\xi }}(x)\,dx}$$

on a Boltzman–Gibbs submanifold \(\mathcal{M}_{\mu }\). Differentiating with respect to ξ j in relations

$$\displaystyle{ \int _{\mathcal{X}}xp_{{\xi }}(x)\,dx =\mu,\quad \int _{\mathcal{X}}p_{{\xi }}(x)\,dx = 1 }$$
(3.10.22)

yields

$$\displaystyle{ \int _{\mathcal{X}}x\,\partial _{j}p(x,\xi )\,dx = 0,\qquad \int _{\mathcal{X}}\partial _{j}p(x,\xi )\,dx = 0. }$$
(3.10.23)

A computation provides

$$\displaystyle\begin{array}{rcl} -\partial _{j}H(\xi )& =& \partial _{j}\int _{\mathcal{X}}p_{{\xi }}(x)\ln p_{{\xi }}(x)\,dx {}\\ & =& \int _{\mathcal{X}}\Big(\partial _{j}p_{{\xi }}(x)\;\ln p_{{\xi }}(x) + p_{{\xi }}(x)\frac{\partial _{j}p_{{\xi }}(x)} {p_{{\xi }}(x)} \Big)\,dx {}\\ & =& \int _{\mathcal{X}}\partial _{j}p(x)\;\ln p_{{\xi }}(x)\,dx +\mathop{\underbrace{ \int _{\mathcal{X}}\partial _{j}p_{{\xi }}(x)\,dx}}\limits _{=0\;by\;(\mbox{ 3.10.23})}. {}\\ \end{array}$$

Hence the critical points \(p_{{\xi }}\) satisfying j H(ξ) = 0 are solutions of the integral equation

$$\displaystyle{ \int \partial _{j}p(x,\xi )\,\ln p(x,\xi )\,dx = 0, }$$
(3.10.24)

subject to the constraint

$$\displaystyle{ \int _{\mathcal{X}}x\partial _{j}p(x,\xi )\,dx = 0. }$$
(3.10.25)

Multiplying (3.10.25) by the Lagrange multiplier λ = λ(ξ) and adding it to (3.10.24) yields

$$\displaystyle{\int _{\mathcal{X}}\partial _{j}p(x,\xi )\Big(\ln p(x,\xi ) +\lambda (\xi )x\Big)\,dx = 0.}$$

Since \(\int \partial _{j}p(x,\xi )\,dx = 0\), it makes sense to consider those critical points for which the term lnp(x, ξ) +λ(ξ)x is a constant function in x, i.e., depends only on ξ

$$\displaystyle{\ln p(x,\xi ) +\lambda (\xi )x =\theta (\xi ).}$$

Then the above equation has the solution

$$\displaystyle{ p(x,\xi ) = e^{\theta (\xi )-\lambda (\xi )x}, }$$
(3.10.26)

which is an exponential family. We still need to determine the functions θ and λ such that the constraints (3.10.22) hold. This will be done explicitly for the case when the sample space is \(\mathcal{X} = [0,1]\). From the second constraint we obtain a relation between θ and λ:

$$\displaystyle{\int _{0}^{1}p(x,\xi )\,dx=1\Longrightarrow e^{\theta (\xi )}\int _{ 0}^{1}e^{-\lambda (\xi )x}\,dx = 1\Longleftrightarrow\frac{1 - e^{-\lambda (\xi )}} {\lambda (\xi )} = e^{-\theta (\xi )},}$$

which leads to

$$\displaystyle{\theta (\xi ) =\ln \frac{\lambda (\xi )} {1 - e^{-\lambda (\xi )}}.}$$

Substituting in (3.10.26) yields

$$\displaystyle{ p(x,\xi ) = \frac{\lambda (\xi )} {1 - e^{-\lambda (\xi )}}e^{-\lambda (\xi )x}. }$$
(3.10.27)

Substituting in the constraint

$$\displaystyle{\int _{0}^{1}xp(x,\xi )\,dx =\mu,}$$

we find

$$\displaystyle\begin{array}{rcl} \frac{\lambda (\xi )} {1 - e^{-\lambda (\xi )}}\int _{0}^{1}xe^{-\lambda (\xi )x}\,dx& =& \mu \Longleftrightarrow {}\\ & & {}\\ \frac{1 -\big (1 +\lambda (\xi )\big)e^{-\lambda (\xi )}} {\lambda (\xi )(1 - e^{-\lambda (\xi )})} & =& \mu \Longleftrightarrow {}\\ & & {}\\ \frac{e^{\lambda (\xi )} -\lambda (\xi ) - 1} {\lambda (\xi )(e^{\lambda (\xi )} - 1)} & =& \mu \Longleftrightarrow {}\\ \frac{1} {\lambda (\xi )} - \frac{1} {e^{\lambda (\xi )} - 1}& =& \mu. {}\\ \end{array}$$

Given μ, we need to solve the above equation for λ(ξ). In order to complete the computation, we need the following result.

Lemma 3.10.2

The function

$$\displaystyle{f(x) = \frac{1} {x} - \frac{1} {e^{x} - 1},\,\,x \in (-\infty,0) \cup (0,\infty ),}$$

has the following properties

  1. i)

    \(\lim _{x\searrow 0}f(x) =\lim _{x\nearrow 0}f(x) = \frac{1} {2}\),

  2. ii)

    \(\lim _{x\longrightarrow \infty }f(x) = 0\), \(\lim _{x\longrightarrow -\infty }f(x) = 1\),

  3. iii)

    f(x) is a strictly decreasing function of x.

Proof:

i) Applying l’Hôspital’s rule twice, we get

$$\displaystyle\begin{array}{rcl} \lim _{x\searrow 0}f(x)& =& \lim _{x\searrow 0}\frac{e^{x} - 1 - x} {x(e^{x} - 1)} =\lim _{x\searrow 0} \frac{e^{x} - 1} {e^{x} - 1 + xe^{x}} {}\\ & =& \lim _{x\searrow 0} \frac{e^{x}} {e^{x} + xe^{x} + e^{x}} =\lim _{x\searrow 0} \frac{1} {2 + x} = \frac{1} {2}. {}\\ \end{array}$$

ii) It follows easily from the properties of the exponential function. ■ 

Since the function f is one-to-one, the equation f(λ) = μ has at most one solution, see Fig. 3.3. More precisely,

  • if μ ≥ 1, the equation has no solution;

  • if μ ∈ (0, 1), the equation has a unique solution, for any ξ, i.e., λ is constant, λ = f −1(μ). For instance, if μ = 1∕2, then λ = 0.

It follows that θ is also constant,

$$\displaystyle{\theta =\ln \frac{\lambda } {1 - e^{-\lambda }}.}$$

Hence the distribution becomes

$$\displaystyle{p(x) = e^{\theta -\lambda x},\quad x \in (0,1).}$$
Figure 3.3:
figure 3

The graph of the decreasing function \(f(x) = \frac{1} {x} - \frac{1} {e^{x} - 1}\) and the solution of the equation f(λ) = μ with μ ∈ (0, 1)

11 Adiabatic Flows

The entropy H(ξ) is a real function defined on the parameter space \(\mathbb{E}\) of the statistical model \(\mathcal{S} =\{ p_{\xi }\}\). The critical points of H(ξ) are solutions of the system i H(ξ) = 0. Suppose that the set C of critical points is void. Then the constant level sets \(\sum _{c}:=\{ H(\xi ) = c\}\) are hypersurfaces in \(\mathbb{E}\). As usual, we accept the denomination of hypersurface for c even if c C consists in a finite number of points.

Let s ⟶ ξ(s), \(\xi (s) \in \mathbb{E}\), be a curve situated in one of the hypersurfaces c . Since H(ξ(s)) = c, it follows

$$\displaystyle{ \frac{d} {ds}H\big(\xi (s)\big) = \partial _{j}H\big(\xi (s)\big)\,\dot{\xi }^{j}(s) = 0. }$$
(3.11.28)

Since \(\dot{\xi }^{j}(s)\) is an arbitrary vector tangent to \(\sum _{c}\), the vector field i H is normal to \(\sum _{c}\). Consequently, any vector field X = (X i) on \(\mathbb{E}\) that satisfies

$$\displaystyle{\partial _{i}H(\xi )X^{i}(\xi ) = 0}$$

is tangent to \(\sum _{c}\).

Let X = (X i) be a vector field tangent to \(\sum _{c}\). The flow ξ(s) defined by

$$\displaystyle{\dot{\xi }(s) = X^{i}(\xi (s)),\,\,i = 1,\ldots,n = \mbox{ dim\, S}}$$

is called adiabatic flow ​ on \(\sum _{c}\). This means H(ξ) = c, since the entropy is unchanged along the flow, i.e., \(H(\xi )\) is a first integral, or \(\sum _{c}\) is an invariant set with respect to this flow.

Suppose now that S = { p ξ } refers to a continuous distribution statistical model. Then

$$\displaystyle\begin{array}{rcl} \partial _{j}H\big(\xi (s)\big)& =& \int _{\mathcal{X}}\ln p\big(x,\xi (s)\big)\;\partial _{j}p\big(x,\xi (s)\big)\,dx {}\\ & =& \int _{\mathcal{X}}\ell_{x}(\xi (s))\partial _{j}\ell_{x}(\xi (s))\,dx, {}\\ \end{array}$$

and combining with (3.11.28) we arrive at the following result:

Proposition 3.11.1

The flow \(\dot{\xi }^{i}(s) = X^{i}(\xi (s))\) is adiabatic if and only if

$$\displaystyle{\int _{\mathcal{X}}\ell_{x}(\xi (s)) \frac{d} {ds}\ell_{x}(\xi (s))\,dx = 0.}$$

Example 3.11.1

If in the case of the normal distribution the entropy along the curve \(s\longrightarrow p_{\sigma (s),\mu (s)}\) is constant, i.e.,

$$\displaystyle{H\big(\sigma (s),\mu (s)\big) =\ln \big (\sigma (s)\sqrt{2\pi e}\big) = c}$$

then \(\sigma (s) = \frac{e^{c}} {\sqrt{2\pi e}}\), constant. Hence the adiabatic flow in this case corresponds to the straight lines

$$\displaystyle{\{\sigma = constant,\;\mu (s)\},}$$

with μ(s) arbitrary curve.

For more information regarding flows the reader is referred to Udriste [80, 82, 83].

12 Problems

  1. 3.1.

    Use the uncertainty function axioms to show the following relations:

    1. (a)

      \(H\Big(\frac{1} {2}, \frac{1} {3}, \frac{1} {6}\Big) = H\Big(\frac{1} {2}, \frac{1} {2}\Big) + \frac{1} {2}H\Big(\frac{2} {3}, \frac{1} {3}\Big)\).

    2. (b)

      \(H\Big(\frac{1} {2}, \frac{1} {4}, \frac{1} {8}, \frac{1} {8}\Big) = H\Big(\frac{3} {4}, \frac{1} {4}\Big) + \frac{3} {4}H\Big(\frac{2} {3}, \frac{1} {3}\Big) + \frac{1} {4}H\Big(\frac{1} {2}, \frac{1} {2}\Big)\).

    3. (c)

      \(H(p_{1},\ldots,p_{n},0) = H(p_{1},\ldots,p_{n})\).

  2. 3.2.

    Consider two events \(A =\{ a_{1},\ldots,a_{m}\}\) and \(B =\{ b_{1},\ldots,b_{n}\}\), and let p(a i , b j ) be the probability of the joint occurrence of outcomes a i and b j . The entropy of the joint event is defined by

    $$\displaystyle{H(A,B) = -\sum _{i,j}p(a_{i},b_{j})\log _{2}p(a_{i},b_{j}).}$$

    Prove the inequality

    $$\displaystyle{H(A,B) \leq H(A) + H(B),}$$

    with identity if and only if the events A and B are independent (i.e., p(a i , b i ) = p(a i )p(b j )).

  3. 3.3.

    If \(A =\{ a_{1},\ldots,a_{m}\}\) and \(B =\{ b_{1},\ldots,b_{n}\}\) are two events, define the conditional entropy of B given A by

    $$\displaystyle{H(B\vert A) = -\sum _{i,j}p(a_{i},b_{j})\log _{2}p_{a_{i}}(b_{j}),}$$

    and the information conveyed about B by A as

    $$\displaystyle{I(B\vert A) = H(B) - H(B\vert A),}$$

    where \(p_{a_{i}}(b_{j}) = \frac{p(a_{i},b_{j})} {\sum _{j}p(a_{i},b_{j})}\) is the conditional probability of b j given a i . Prove the following:

    1. (a)

      H(A, B) = H(A) + H(B | A); 

    2. (b)

      H(B) ≥ H(B | A). When does the equality hold?

    3. (c)

      H(B | A) − H(A | B) = H(B) − H(A); 

    4. (d)

      I(B | A) = I(A | B). 

  4. 3.4.

    Let X be a real-valued continuous random variable on \(\mathbb{R}^{n}\), with density function p(x). Define the entropy of X by

    $$\displaystyle{H(X) = -\int _{\mathbb{R}^{n}}p(x)\,\ln p(x)\,dx.}$$
    1. (a)

      Show that the entropy is translation invariant, i.e., H(X) = H(X + c), for any constant \(c \in \mathbb{R}\).

    2. (b)

      Prove the formula H(aX) = H(X) + ln | a | , for any constant \(a \in \mathbb{R}\). Show that by rescaling the random variable the entropy can change from negative to positive and vice versa.

    3. (c)

      Show that in the case of a vector valued random variable \(Y: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\) and an n × n matrix A we have

      $$\displaystyle{H(AY ) = H(Y ) +\ln \vert \det A\vert.}$$
    4. (d)

      Use (c) to prove that the entropy is invariant under orthogonal transformations of the random variable.

  5. 3.5.

    The joint and conditional entropies of two continuous random variables X and Y are given by

    $$\displaystyle{H(X,Y ) = -\int \!\!\!\int p(x,y)\,\log _{2}p(x,y)\,dxdy,}$$
    $$\displaystyle{H(Y \vert X) = -\int \!\!\!\int p(x,y)\,\log _{2}\frac{p(x,y)} {p(x)} \,dxdy,}$$

    where p(x) = ∫ p(x, y) dy is the marginal probability of X. Prove the following:

    1. (a)

      H(X, Y ) = H(X) + H(Y | X) = H(Y ) + H(X | Y ); 

    2. (b)

      H(Y | X) ≤ H(Y ).

  6. 3.6.

    Let α(x, y) be a function with α(x, y) ≥ 0, \(\int _{\mathbb{R}}\alpha (x,y)\,dx =\int _{\mathbb{R}}\alpha (x,y)\,dy = 1\). Consider the averaging operation

    $$\displaystyle{q(y) =\int _{\mathbb{R}}\alpha (x,y)p(x)\,dx.}$$

    Prove that the entropy of the averaged distribution q(y) is equal to or greater than the entropy of p(x), i.e., H(q) ≥ H(p).

  7. 3.7.

    Consider the two-dimensional statistical model defined by

    $$\displaystyle{p(x,\xi ^{1},\xi ^{2}) = 2\xi ^{1}x + 3\xi ^{2}x^{2} + 4(1 -\xi ^{1} -\xi ^{2})x^{3},\qquad x \in (0,1).}$$
    1. (a)

      Compute the Fisher metric g ij (ξ).

    2. (b)

      Compute the entropy H(p).

    3. (c)

      Find ξ for which H is critical. Does it correspond to a maximum or to a minimum?

  8. 3.8.

    Find a generic formula for the informational entropy of the exponential family \(p(\xi,x) = e^{C(x)+\xi ^{i}F_{ i}(x)-\phi (\xi )}\), \(x \in \mathcal{X}\).

  9. 3.9.

    (The change of the entropy under a change of coordinates.) Consider the vector random variables X and Y, related by Y = ϕ(X), with \(\phi: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\) invertible transformation.

    1. (a)

      Show that

      $$\displaystyle{H(Y ) = H(X) - E[\ln J_{\phi ^{-1}}],}$$

      where \(J_{\phi ^{-1}}\) is the Jacobian of ϕ −1 and E[ ⋅ ] is the expectation with respect to the probability density of X.

    2. (b)

      Consider the linear transformation Y = AX, with \(A \in \mathbb{R}^{n\times n}\) nonsingular matrix. What is the relation expressed by part (a) in this case?

  10. 3.10.

    Consider the Gaussian distribution

    $$\displaystyle{p(x_{1},\ldots,x_{n}) = \frac{\sqrt{\det A}} {(2\pi )^{n/2}}e^{-\frac{1} {2} \langle Ax,x\rangle },}$$

    where A is a symmetric n × n matrix. Show that the entropy of p is

    $$\displaystyle{H = \frac{1} {2}\ln [(2\pi e)^{n}\det A].}$$
  11. 3.11.

    Let \(X = (X_{1},\ldots,X_{n})\) be a random vector in \(\mathbb{R}^{n}\), with E[X j ] = 0 and denote by A = a ij  = E[X i X j ] the associated covariance matrix. Prove that

    $$\displaystyle{H(X) \leq \frac{1} {2}\ln [(2\pi e)^{n}\det A].}$$

    When is the equality reached?

  12. 3.12.

    Consider the density of an exponentially distributed random variable with parameter λ > 0

    $$\displaystyle{p(x,\lambda ) =\lambda e^{-\lambda x},\qquad x \geq 0.}$$

    Find its entropy.

  13. 3.13.

    Consider the Cauchy’s distribution on \(\mathbb{R}\)

    $$\displaystyle{p(x,\xi ) = \frac{\xi } {4\pi } \frac{1} {x^{2} +\xi ^{2}},\qquad \xi > 0.}$$

    Show that its entropy is

    $$\displaystyle{H(\xi ) =\ln (4\pi \xi ).}$$
  14. 3.14.

    Find a generic formula for the informational energy of the mixture family p(ξ, x) = C(x) +ξ i F i (x), \(x \in \mathcal{X}\).

  15. 3.15.

    Let \(f(x) = \frac{x} {\sigma ^{2}} e^{-\frac{x^{2}} {2\sigma ^{2}} }\), x ≥ 0, σ > 0, be the Rayleigh distribution. Prove that its entropy is given by

    $$\displaystyle{H(\sigma ) = 1 +\ln \frac{\sigma } {\sqrt{2}} + \frac{\gamma } {2},}$$

    where γ is Euler’s constant.

  16. 3.16.

    Show that the entropy of the Maxwell–Boltzmann distribution

    $$\displaystyle{p(x,a) = \frac{1} {a^{3}}\sqrt{\frac{2} {\pi }} x^{2}e^{-\frac{x^{2}} {2a^{2}} },\qquad a > 0,\;x \in \mathbb{R}}$$

    is \(H(a) = \frac{1} {2} -\gamma -\ln (a\sqrt{2\pi })\), where γ is Euler’s constant.

  17. 3.17.

    Consider the Laplace distribution

    $$\displaystyle{f(x,b,\mu ) = \frac{1} {2b}e^{-\vert x-\mu \vert /b},\qquad b > 0,\mu \in \mathbb{R}.}$$

    Show that its entropy is

    $$\displaystyle{H(b,\mu ) = 1 +\ln (2b).}$$
  18. 3.18.

    Let \(\mu \in \mathbb{R}\). Construct a statistical model

    $$\displaystyle{\mathcal{S} =\{ p_{\xi }(x);\,\xi \in \mathbb{E},x \in \mathcal{X}\}}$$

    such that the functional \(F: \mathcal{S}\longrightarrow \mathbb{R}\),

    $$\displaystyle{F(p(\cdot )) =\int _{\mathcal{X}}xp(x)\,dx-\mu }$$

    has at least one critical point. Is \(\mathcal{M}_{\mu } = F^{-1}(0)\) a submanifold of \(\mathcal{S}\)?

  19. 3.19.

    Starting from the Euclidean space \((\mathbb{R}_{+}^{n},\delta _{ij})\), find the Hessian metric produced by the Shannon entropy function

    $$\displaystyle{f: \mathbb{R}_{+}^{n} \rightarrow R,\,\,f(x^{1},\cdots \,,x^{n}) = \frac{1} {k^{2}}\sum _{i=1}^{n}\,\ln (k^{2}x^{i}).}$$
  20. 3.20.

    Show that the inequality (3.9.20) becomes identity for the uniform distribution \(p: [0,e] \rightarrow [0,\infty )\), p(x) = 1∕e, and this is the only distribution with this property.

  21. 3.21.
    1. (a)

      Let \(a_{n}(x) = \frac{\xi ^{n}\ln (n!)} {n!}\). Show that \(\lim _{n\rightarrow \infty }\Big\vert \frac{a_{n+1}(x)} {a_{n}(x)} \Big\vert = 0\) for any x;

    2. (b)

      Show that the series \(\sum _{n\geq 0}\frac{\xi ^{n}\ln (n!)} {n!}\) has an infinite radius of convergence;

    3. (c)

      Deduce that the entropy for the Poisson distribution is finite.

  22. 3.22.

    Show that the entropy of the beta distribution

    $$\displaystyle{p_{a,b}(x) = \frac{1} {B(a,b)}\,\,x^{a-1}(1 - x)^{b-1},\quad 0 \leq x \leq 1}$$

    is always non-positive, H(α, β) ≤ 0, for any a, b > 0. For which values of a and b does the entropy vanish?