Access provided by Autonomous University of Puebla. Download reference work entry PDF
Kullback-Leibler divergence (Kullback 1951) is an information-based measure of disparity among probability distributions. Given distributions P and Q defined over X, with Q absolutely continuous with respect to P, the Kullback-Leibler divergence of Q from P is the P-expectation of \({-\log }_{2}\{P/Q\}.\ \mathrm{So},\ {D}_{KL}(P,Q) = -{\int \nolimits \nolimits {}_{\,\,\,X}\log }_{2}(Q(x)/P(x))dP.\) This quantity can be seen as the difference between the cross-entropy forQonP, H(P, Q) = − ∫X log2(Q(x))dP, and the self-entropy (Shannon 1948) of P, H(P) = H(P, P) = − ∫X log2(P(x))dP. Since H(P, Q) is the P-expectation of the number of bits of information, beyond those encoded in Q, that are needed to identify points in X, D KL (P, Q) = H(P) − H(P, Q) is the expected difference, from the perspective of P, between the information encoded in P and the information encoded in Q.
D KL has a number of features that make it plausible as a measure of probabilistic divergence. Here are some of its key properties:
-
Premetric.D KL (P, Q) ≥ 0, with identity if and only if P = Q a.e. with respect to P.
-
Convexity.D KL (P, Q) is convex in both P and Q.
-
Chain Rule. Given joint distributions P(x, y) and Q(x, y), define the KL-divergence conditional on x asD KL (P(y | x), Q(y | x)) = ∫X D KL (P(y | x), Q(y | x))dP x where P x is P’s x-marginal. Then,D KL (P(x, y), Q(x, y)) = D KL (P x , Q x ) + D KL (P(y | x), Q(y | x)).
-
Independence. When X and Y are independent in both P and Q the Chain Rule assumes the simple form D KL (P(x, y), Q(x, y)) = D KL (P x , Q x ) + D KL (P y , Q y ), which reflects the well-known idea that independent information is additive.
It should be emphasized that KL-divergence is not a genuine metric: it is not symmetric and fails the triangle inequality. Thus, talk of Kullback-Leibler “distance” is misleading. While one can create a symmetric divergence measure by setting D KL ∗(P, Q) = 1 ∕ 2 D KL (P, Q) + 1 ∕ 2 D KL (Q, P), this still fails the triangle inequality.
There is a close relationship between KL-divergence and a number of other statistical concepts. Consider, for example, mutual information. Given a joint distribution P(x, y) on X ×Y with marginals P X and P Y , the mutual information of X and Y with respect to P is defined as I P (X, Y ) = − ∫X ×Y log2(P(x, y) ∕ [P X (x) ⋅P Y (y)])dP. If we let P ⊥(x, y) = P X (x) ⋅P Y (y) be the factorization of P, then I P (X, Y ) = D(P, P ⊥). Thus, according to KL-divergence, mutual information measures the dissimilarity of a joint distribution from its factorization.
There is also a connection between KL-divergence and maximum likelihood estimation. Let l x (θ) = p(x | θ) be a likelihood function with parameter θ ∈ Θ, and imagine that enough data has been collected to make a certain empirical distribution f(x) seem reasonable. In MLE one often hopes to find an estimate for θ that maximizes expected log-likelihood ’s data, i.e., we seek \( \theta \,\hat{} {=\arg \max }_{\theta }{E}_{f}{[\log }_{2}(p(x\vert \theta )].\) To find this quantity it suffices to minimize the KL-divergence between f(x) and \( p(x\vert \theta \,\hat{})\) since
In short, MLE minimizes Kullback-Leibler divergence from the empirical distribution.
Kullback-Leibler also plays a role in model selection. Indeed, Akaike (1973) uses D KL as the basis for his “information criterion” (AIC). Here, we imagine an unknown true distribution P(x) over a sample space X, and a set Π θ of models each element of which specifies a parameterized set of distributions π(x | θ) over X. The models in Π θ are meant to approximate P, and the aim is to find the best approximation in light of data drawn from P. For each π and θ, D KL (P, π(x | θ)) measures the information lost when π(x | θ) is used to approximate P. If θ were known, one could minimize information loss by choosing π to minimize D KL (P, π(x | θ)). But, since θ is unknown one must estimate. For each body of data y and each π, let \({\theta \,\hat{}}_{y}\) be the MLE estimate for θ given y, and consider \({D}_{KL}(P,\pi (x\vert {\theta \,\hat{}}_{y}))\) as a random variable of y. Akaike maintained that one should choose the model that minimizes the expected value of this quantity, so that one chooses π to minimize \({E}_{y}[{D}_{KL}(P,\pi (x\vert {\theta \,\hat{}}_{y}))] = {E}_{y}[H(P,P) - H(P,\pi (\cdot \vert {\theta \,\hat{}}_{y}))].\) This is equivalent to maximizing \({E}_{y}{E}_{x}{[\log }_{2}(\pi (x\vert {\theta \,\hat{}}_{y}))].\) Akaike proved that \(2k {-\log }_{2}({l}_{x}(\theta \,\hat{}))\) is an unbiased estimate of this quantity for large samples, where \(\theta \,\hat{}\) is the MLE estimate of θ and k is the number of estimated parameters. In this way, some have claimed, the policy of minimizing KL-divergence leads one to value simplicity in models since the “2k” term functions as a kind of penalty for complexity. (see Sober 2002).
KL-divergence also figures prominently in Bayesian approaches experimental design, where it is treated as a utility function. The objective in such work is to design experiments that maximize KL-divergence between the prior and posterior. The results of such experiments are interpreted as having a high degree of informational content. Lindley (1956) and De Groot (1962) are essential references here.
Bayesians have also appealed to KL-divergence to provide a rationale for Bayesian conditioning and related belief update rules, e.g., the probability kinematics of Jeffrey (1965). For example, Diaconis and Zabell (1982) show that the posterior probabilities prescribed by Bayesian conditioning or by probability kinematics minimize KL-divergence from the perspective of the prior. Thus, in the sense of information divergence captured by D KL , these forms of updating introduce the least amount of new information consistent with the data received.
About the Author
For biography see the entry the St. Petersburg paradox.
References and Further Reading
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) Proceedings of the international symposium on information theory. Budapest, Akademiai Kiado
De Groot M (1962) Uncertainty, information, and sequential experiments. Ann Math Stat 33:404–419
Diaconis P, Zabell S (1982) Updating subjective probability. J Am Stat Assoc 77:822–830
Jeffrey R (1965) The logic of decision. McGraw-Hill, New York
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Lindley DV (1956) On the measure of information provided by an experiment. Ann Stat 27:985–1005
Shannon CE (1948) A mathematical theory of communication. AT&T Tech J 27(379–423):623–656
Sober E (2002) Instrumentalism, parsimony, and the Akaike framework. Philos Sci 69:S112–S123
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this entry
Cite this entry
Joyce, J.M. (2011). Kullback-Leibler Divergence. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_327
Download citation
DOI: https://doi.org/10.1007/978-3-642-04898-2_327
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04897-5
Online ISBN: 978-3-642-04898-2
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering