Kullback-Leibler divergence (Kullback 1951) is an information-based measure of disparity among probability distributions. Given distributions P and Q defined over X, with Q absolutely continuous with respect to P, the Kullback-Leibler divergence of Q from P is the P-expectation of \({-\log }_{2}\{P/Q\}.\ \mathrm{So},\ {D}_{KL}(P,Q) = -{\int \nolimits \nolimits {}_{\,\,\,X}\log }_{2}(Q(x)/P(x))dP.\) This quantity can be seen as the difference between the cross-entropy forQonP, H(P, Q) = − ∫​​​X log2(Q(x))dP, and the self-entropy (Shannon 1948) of P, H(P) = H(P, P) = − ∫​​​X log2(P(x))dP. Since H(P, Q) is the P-expectation of the number of bits of information, beyond those encoded in Q, that are needed to identify points in X, D KL (P, Q) = H(P) − H(P, Q) is the expected difference, from the perspective of P, between the information encoded in P and the information encoded in Q.

D KL has a number of features that make it plausible as a measure of probabilistic divergence. Here are some of its key properties:

  • Premetric.D KL (P, Q) ≥ 0, with identity if and only if P = Q a.e. with respect to P.

  • Convexity.D KL (P, Q) is convex in both P and Q.

  • Chain Rule. Given joint distributions P(x, y) and Q(x, y), define the KL-divergence conditional on x asD KL (P(y | x), Q(y | x)) = ∫​​​X D KL (P(y | x), Q(y | x))dP x where P x is P’s x-marginal. Then,D KL (P(x, y), Q(x, y)) = D KL (P x , Q x ) + D KL (P(y | x), Q(y | x)).

  • Independence. When X and Y are independent in both P and Q the Chain Rule assumes the simple form D KL (P(x, y), Q(x, y)) = D KL (P x , Q x ) + D KL (P y , Q y ), which reflects the well-known idea that independent information is additive.

It should be emphasized that KL-divergence is not a genuine metric: it is not symmetric and fails the triangle inequality. Thus, talk of Kullback-Leibler “distance” is misleading. While one can create a symmetric divergence measure by setting D KL (P, Q) = 1​​ ∕ ​2 D KL (P, Q) + 1​​ ∕ ​2 D KL (Q, P), this still fails the triangle inequality.

There is a close relationship between KL-divergence and a number of other statistical concepts. Consider, for example, mutual information. Given a joint distribution P(x, y) on X ×Y with marginals P X and P Y , the mutual information of X and Y with respect to P is defined as I P (X, Y ) = − ∫​​​X ×Y log2(P(x, y) ∕ [P X (x) ⋅P Y (y)])dP. If we let P (x, y) = P X (x) ⋅P Y (y) be the factorization of P, then I P (X, Y ) = D(P, P ). Thus, according to KL-divergence, mutual information measures the dissimilarity of a joint distribution from its factorization.

There is also a connection between KL-divergence and maximum likelihood estimation. Let l x (θ) = p(x | θ) be a likelihood function with parameter θ ∈ Θ, and imagine that enough data has been collected to make a certain empirical distribution f(x) seem reasonable. In MLE one often hopes to find an estimate for θ that maximizes expected log-likelihood ’s data, i.e., we seek \( \theta \,\hat{} {=\arg \max }_{\theta }{E}_{f}{[\log }_{2}(p(x\vert \theta )].\) To find this quantity it suffices to minimize the KL-divergence between f(x) and \( p(x\vert \theta \,\hat{})\) since

$$\begin{array}{rcl} & \arg \min_{\theta }& \,{D}_{KL}(\,f,p(\cdot \vert \theta \,\hat{})) \\ & & =\arg \min_{\theta } -{\int \nolimits \nolimits }_{X}f(x) {\cdot \log }_{2}(p(x\vert \theta \,\hat{})/f(x))dx \\ & & =\arg \min_{\theta }[H(\,f,f)\,-\,H(\,f,p(\cdot \vert \theta \,\hat{}))] \\ & & =\arg \max_{\theta }H(\,f,p(\cdot \vert \theta \,\hat{})) \\ & & =\arg \max_{\theta }{E}_{f}[\log_{2}(p(x\vert \theta ))]\end{array}$$

In short, MLE minimizes Kullback-Leibler divergence from the empirical distribution.

Kullback-Leibler also plays a role in model selection. Indeed, Akaike (1973) uses D KL as the basis for his “information criterion” (AIC). Here, we imagine an unknown true distribution P(x) over a sample space X, and a set Π θ of models each element of which specifies a parameterized set of distributions π(x | θ) over X. The models in Π θ are meant to approximate P, and the aim is to find the best approximation in light of data drawn from P. For each π and θ, D KL (P, π(x | θ)) measures the information lost when π(x | θ) is used to approximate P. If θ were known, one could minimize information loss by choosing π to minimize D KL (P, π(x | θ)). But, since θ is unknown one must estimate. For each body of data y and each π, let \({\theta \,\hat{}}_{y}\) be the MLE estimate for θ given y, and consider \({D}_{KL}(P,\pi (x\vert {\theta \,\hat{}}_{y}))\) as a random variable of y. Akaike maintained that one should choose the model that minimizes the expected value of this quantity, so that one chooses π to minimize \({E}_{y}[{D}_{KL}(P,\pi (x\vert {\theta \,\hat{}}_{y}))] = {E}_{y}[H(P,P) - H(P,\pi (\cdot \vert {\theta \,\hat{}}_{y}))].\) This is equivalent to maximizing \({E}_{y}{E}_{x}{[\log }_{2}(\pi (x\vert {\theta \,\hat{}}_{y}))].\) Akaike proved that \(2k {-\log }_{2}({l}_{x}(\theta \,\hat{}))\) is an unbiased estimate of this quantity for large samples, where \(\theta \,\hat{}\) is the MLE estimate of θ and k is the number of estimated parameters. In this way, some have claimed, the policy of minimizing KL-divergence leads one to value simplicity in models since the “2k” term functions as a kind of penalty for complexity. (see Sober 2002).

KL-divergence also figures prominently in Bayesian approaches experimental design, where it is treated as a utility function. The objective in such work is to design experiments that maximize KL-divergence between the prior and posterior. The results of such experiments are interpreted as having a high degree of informational content. Lindley (1956) and De Groot (1962) are essential references here.

Bayesians have also appealed to KL-divergence to provide a rationale for Bayesian conditioning and related belief update rules, e.g., the probability kinematics of Jeffrey (1965). For example, Diaconis and Zabell (1982) show that the posterior probabilities prescribed by Bayesian conditioning or by probability kinematics minimize KL-divergence from the perspective of the prior. Thus, in the sense of information divergence captured by D KL , these forms of updating introduce the least amount of new information consistent with the data received.

About the Author

For biography see the entry the St. Petersburg paradox.

Cross References

Akaike’s Information Criterion: Background, Derivation, Properties, and Refinements

Chernoff Bound

Entropy

Entropy and Cross Entropy as Diversity and Distance Measures

Estimation

Information Theory and Statistics

Measurement of Uncertainty

RadonNikodým Theorem

Statistical Evidence