Kullback-Leibler Divergence

Joyce, James M.

doi:10.1007/978-3-642-04898-2_327

James M. Joyce²

6045 Accesses
166 Citations

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Kullback-Leibler divergence (Kullback 1951) is an information-based measure of disparity among probability distributions. Given distributions P and Q defined over X, with Q absolutely continuous with respect to P, the Kullback-Leibler divergence of Q from P is the P-expectation of ${-\log }_{2}\{P/Q\}.\ \mathrm{So},\ {D}_{KL}(P,Q) = -{\int \nolimits \nolimits {}_{\,\,\,X}\log }_{2}(Q(x)/P(x))dP.$ This quantity can be seen as the difference between the cross-entropy forQonP, H(P, Q) = − ∫_Xlog₂(Q(x))dP, and the self-entropy (Shannon 1948) of P, H(P) = H(P, P) = − ∫_Xlog₂(P(x))dP. Since H(P, Q) is the P-expectation of the number of bits of information, beyond those encoded in Q, that are needed to identify points in X, D _KL(P, Q) = H(P) − H(P, Q) is the expected difference, from the perspective of P, between the information encoded in P and the information encoded in Q.

D _KL has a number of features that make it plausible as a measure of probabilistic divergence. Here are some of its key properties:

Premetric.D _KL(P, Q) ≥ 0, with identity if and only if P = Q a.e. with respect to P.
Convexity.D _KL(P, Q) is convex in both P and Q.
Chain Rule. Given joint distributions P(x, y) and Q(x, y), define the KL-divergence conditional on x asD _KL(P(y | x), Q(y | x)) = ∫_X D _KL(P(y | x), Q(y | x))dP _xwhere P _x is P’s x-marginal. Then,D _KL(P(x, y), Q(x, y)) = D _KL(P _x, Q _x) + D _KL(P(y | x), Q(y | x)).
Independence. When X and Y are independent in both P and Q the Chain Rule assumes the simple form D _KL(P(x, y), Q(x, y)) = D _KL(P _x, Q _x) + D _KL(P _y, Q _y), which reflects the well-known idea that independent information is additive.

It should be emphasized that KL-divergence is not a genuine metric: it is not symmetric and fails the triangle inequality. Thus, talk of Kullback-Leibler “distance” is misleading. While one can create a symmetric divergence measure by setting D _KL ^∗(P, Q) = ¹ ∕ ₂ D _KL(P, Q) + ¹ ∕ ₂ D _KL(Q, P), this still fails the triangle inequality.

There is a close relationship between KL-divergence and a number of other statistical concepts. Consider, for example, mutual information. Given a joint distribution P(x, y) on X ×Y with marginals P _X and P _Y, the mutual information of X and Y with respect to P is defined as I _P(X, Y ) = − ∫_{X ×Y}log₂(P(x, y) ∕ [P _X(x) ⋅P _Y(y)])dP. If we let P _⊥(x, y) = P _X(x) ⋅P _Y(y) be the factorization of P, then I _P(X, Y ) = D(P, P _⊥). Thus, according to KL-divergence, mutual information measures the dissimilarity of a joint distribution from its factorization.

There is also a connection between KL-divergence and maximum likelihood estimation. Let l _x(θ) = p(x | θ) be a likelihood function with parameter θ ∈ Θ, and imagine that enough data has been collected to make a certain empirical distribution f(x) seem reasonable. In MLE one often hopes to find an estimate for θ that maximizes expected log-likelihood ’s data, i.e., we seek $ \theta \,\hat{} {=\arg \max }_{\theta }{E}_{f}{[\log }_{2}(p(x\vert \theta )].$ To find this quantity it suffices to minimize the KL-divergence between f(x) and $ p(x\vert \theta \,\hat{})$ since

$$\begin{array}{rcl} & \arg \min_{\theta }& \,{D}_{KL}(\,f,p(\cdot \vert \theta \,\hat{})) \\ & & =\arg \min_{\theta } -{\int \nolimits \nolimits }_{X}f(x) {\cdot \log }_{2}(p(x\vert \theta \,\hat{})/f(x))dx \\ & & =\arg \min_{\theta }[H(\,f,f)\,-\,H(\,f,p(\cdot \vert \theta \,\hat{}))] \\ & & =\arg \max_{\theta }H(\,f,p(\cdot \vert \theta \,\hat{})) \\ & & =\arg \max_{\theta }{E}_{f}[\log_{2}(p(x\vert \theta ))]\end{array}$$

In short, MLE minimizes Kullback-Leibler divergence from the empirical distribution.

Kullback-Leibler also plays a role in model selection. Indeed, Akaike (1973) uses D _KL as the basis for his “information criterion” (AIC). Here, we imagine an unknown true distribution P(x) over a sample space X, and a set Π _θ of models each element of which specifies a parameterized set of distributions π(x | θ) over X. The models in Π _θ are meant to approximate P, and the aim is to find the best approximation in light of data drawn from P. For each π and θ, D _KL(P, π(x | θ)) measures the information lost when π(x | θ) is used to approximate P. If θ were known, one could minimize information loss by choosing π to minimize D _KL(P, π(x | θ)). But, since θ is unknown one must estimate. For each body of data y and each π, let ${\theta \,\hat{}}_{y}$ be the MLE estimate for θ given y, and consider ${D}_{KL}(P,\pi (x\vert {\theta \,\hat{}}_{y}))$ as a random variable of y. Akaike maintained that one should choose the model that minimizes the expected value of this quantity, so that one chooses π to minimize ${E}_{y}[{D}_{KL}(P,\pi (x\vert {\theta \,\hat{}}_{y}))] = {E}_{y}[H(P,P) - H(P,\pi (\cdot \vert {\theta \,\hat{}}_{y}))].$ This is equivalent to maximizing ${E}_{y}{E}_{x}{[\log }_{2}(\pi (x\vert {\theta \,\hat{}}_{y}))].$ Akaike proved that $2k {-\log }_{2}({l}_{x}(\theta \,\hat{}))$ is an unbiased estimate of this quantity for large samples, where $\theta \,\hat{}$ is the MLE estimate of θ and k is the number of estimated parameters. In this way, some have claimed, the policy of minimizing KL-divergence leads one to value simplicity in models since the “2k” term functions as a kind of penalty for complexity. (see Sober 2002).

KL-divergence also figures prominently in Bayesian approaches experimental design, where it is treated as a utility function. The objective in such work is to design experiments that maximize KL-divergence between the prior and posterior. The results of such experiments are interpreted as having a high degree of informational content. Lindley (1956) and De Groot (1962) are essential references here.

Bayesians have also appealed to KL-divergence to provide a rationale for Bayesian conditioning and related belief update rules, e.g., the probability kinematics of Jeffrey (1965). For example, Diaconis and Zabell (1982) show that the posterior probabilities prescribed by Bayesian conditioning or by probability kinematics minimize KL-divergence from the perspective of the prior. Thus, in the sense of information divergence captured by D _KL, these forms of updating introduce the least amount of new information consistent with the data received.

About the Author

For biography see the entry the St. Petersburg paradox.

Cross References

Akaike’s Information Criterion: Background, Derivation, Properties, and Refinements

Chernoff Bound

Entropy

Entropy and Cross Entropy as Diversity and Distance Measures

Estimation

Information Theory and Statistics

Measurement of Uncertainty

RadonNikodým Theorem

Statistical Evidence

References and Further Reading

Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) Proceedings of the international symposium on information theory. Budapest, Akademiai Kiado
Google Scholar
De Groot M (1962) Uncertainty, information, and sequential experiments. Ann Math Stat 33:404–419
Article Google Scholar
Diaconis P, Zabell S (1982) Updating subjective probability. J Am Stat Assoc 77:822–830
Article MATH MathSciNet Google Scholar
Jeffrey R (1965) The logic of decision. McGraw-Hill, New York
Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Article MATH MathSciNet Google Scholar
Lindley DV (1956) On the measure of information provided by an experiment. Ann Stat 27:985–1005
MathSciNet Google Scholar
Shannon CE (1948) A mathematical theory of communication. AT&T Tech J 27(379–423):623–656
MathSciNet Google Scholar
Sober E (2002) Instrumentalism, parsimony, and the Akaike framework. Philos Sci 69:S112–S123
Article Google Scholar

Download references

Author information

Authors and Affiliations

Chair and Professor of Philosophy and of Statistics, University of Michigan, Ann Arbor, MI, USA
James M. Joyce

Authors

James M. Joyce
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics and Informatics, Faculty of Economics, University of Kragujevac, City of Kragujevac, Serbia
Miodrag Lovric

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Joyce, J.M. (2011). Kullback-Leibler Divergence. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_327

Download citation

DOI: https://doi.org/10.1007/978-3-642-04898-2_327
Published: 02 December 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04897-5
Online ISBN: 978-3-642-04898-2
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering

Publish with us

Policies and ethics