Kullback–Leibler divergence



















In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.[1][2] Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.




Contents






  • 1 Etymology


  • 2 Definition


  • 3 Basic example


  • 4 Interpretations


  • 5 Characterization


  • 6 Motivation


  • 7 Properties


  • 8 Examples


    • 8.1 Multivariate normal distributions




  • 9 Relation to metrics


    • 9.1 Fisher information metric


    • 9.2 Fisher information metric theorem




  • 10 Relation to other quantities of information theory


    • 10.1 Self-information


    • 10.2 Mutual information


    • 10.3 Shannon entropy


    • 10.4 Conditional entropy


    • 10.5 Cross entropy




  • 11 Bayesian updating


    • 11.1 Bayesian experimental design




  • 12 Discrimination information


    • 12.1 Principle of minimum discrimination information




  • 13 Relationship to available work


  • 14 Quantum information theory


  • 15 Relationship between models and reality


  • 16 Symmetrised divergence


  • 17 Relationship to other probability-distance measures


  • 18 Data differencing


  • 19 See also


  • 20 References


  • 21 External links





Etymology


The Kullback–Leibler divergence was introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback preferred the term discrimination information.[3] The divergence is discussed in Kullback's 1959 book, Information Theory and Statistics.[2]



Definition


For discrete probability distributions P{displaystyle P}P and Q{displaystyle Q}Q defined on the same probability space, the Kullback–Leibler divergence between P{displaystyle P}P and Q{displaystyle Q}Q is defined[4] to be








DKL(P∥Q)=−x∈XP(x)log⁡(Q(x)P(x)){displaystyle D_{text{KL}}(Pparallel Q)=-sum _{xin {mathcal {X}}}P(x)log left({frac {Q(x)}{P(x)}}right)}{displaystyle D_{text{KL}}(Pparallel Q)=-sum _{xin {mathcal {X}}}P(x)log left({frac {Q(x)}{P(x)}}right)}












 



 



 



 





(Eq.1)




which is equivalent to


DKL(P∥Q)=∑x∈XP(x)log⁡(P(x)Q(x)).{displaystyle D_{text{KL}}(Pparallel Q)=sum _{xin {mathcal {X}}}P(x)log left({frac {P(x)}{Q(x)}}right).}{displaystyle D_{text{KL}}(Pparallel Q)=sum _{xin {mathcal {X}}}P(x)log left({frac {P(x)}{Q(x)}}right).}

In other words, it is the expectation of the logarithmic difference between the probabilities P{displaystyle P}P and Q{displaystyle Q}Q, where the expectation is taken using the probabilities P{displaystyle P}P. The Kullback–Leibler divergence is defined only if for all x{displaystyle x}x, Q(x)=0{displaystyle Q(x)=0}Q(x)=0 implies P(x)=0{displaystyle P(x)=0}P(x)=0 (absolute continuity). Whenever P(x){displaystyle P(x)}P(x) is zero the contribution of the corresponding term is interpreted as zero because




limx→0+xlog⁡(x)=0{displaystyle lim _{xto 0^{+}}xlog(x)=0}

{displaystyle lim _{xto 0^{+}}xlog(x)=0}
.

For distributions P{displaystyle P}P and Q{displaystyle Q}Q of a continuous random variable, the Kullback–Leibler divergence is defined to be the integral:[5]:p. 55








DKL(P∥Q)=∫p(x)log⁡(p(x)q(x))dx{displaystyle D_{text{KL}}(Pparallel Q)=int _{-infty }^{infty }p(x)log left({frac {p(x)}{q(x)}}right),dx}{displaystyle D_{text{KL}}(Pparallel Q)=int _{-infty }^{infty }p(x)log left({frac {p(x)}{q(x)}}right),dx}












 



 



 



 





(Eq.2)




where p{displaystyle p}p and q{displaystyle q}q denote the probability densities of P{displaystyle P}P and Q{displaystyle Q}Q.


More generally, if P{displaystyle P}P and Q{displaystyle Q}Q are probability measures over a set X{displaystyle {mathcal {X}}}{mathcal{X}}, and P{displaystyle P}P is absolutely continuous with respect to Q{displaystyle Q}Q, then the Kullback–Leibler divergence from Q{displaystyle Q}Q to P{displaystyle P}P is defined as


DKL(P∥Q)=∫Xlog⁡(dPdQ)dP,{displaystyle D_{text{KL}}(Pparallel Q)=int _{mathcal {X}}log left({frac {dP}{dQ}}right),dP,}{displaystyle D_{text{KL}}(Pparallel Q)=int _{mathcal {X}}log left({frac {dP}{dQ}}right),dP,}

where dPdQ{displaystyle {frac {dP}{dQ}}}{displaystyle {frac {dP}{dQ}}} is the Radon–Nikodym derivative of P{displaystyle P}P with respect to Q{displaystyle Q}Q, and provided the expression on the right-hand side exists. Equivalently (by the chain rule), this can be written as


DKL(P∥Q)=∫Xlog⁡(dPdQ)dPdQdQ,{displaystyle D_{text{KL}}(Pparallel Q)=int _{mathcal {X}}log left({frac {dP}{dQ}}right){frac {dP}{dQ}},dQ,}{displaystyle D_{text{KL}}(Pparallel Q)=int _{mathcal {X}}log left({frac {dP}{dQ}}right){frac {dP}{dQ}},dQ,}

which is the entropy of P{displaystyle P}P relative to Q{displaystyle Q}Q. Continuing in this case, if μ{displaystyle mu }mu is any measure on X{displaystyle {mathcal {X}}}{mathcal {X}} for which p=dPdμ{displaystyle p={frac {dP}{dmu }}}{displaystyle p={frac {dP}{dmu }}} and q=dQdμ{displaystyle q={frac {dQ}{dmu }}}{displaystyle q={frac {dQ}{dmu }}} exist (meaning that p{displaystyle p}p and q{displaystyle q}q are absolutely continuous with respect to μ{displaystyle mu }mu ), then the Kullback–Leibler divergence from Q{displaystyle Q}Q to P{displaystyle P}P is given as


DKL(P∥Q)=∫Xplog⁡(pq)dμ.{displaystyle D_{text{KL}}(Pparallel Q)=int _{mathcal {X}}plog left({frac {p}{q}}right),dmu .}{displaystyle D_{text{KL}}(Pparallel Q)=int _{mathcal {X}}plog left({frac {p}{q}}right),dmu .}

The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e{displaystyle e}e if information is measured in nats. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm.


Various conventions exist for referring to DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} in words. Often it is referred to as the divergence between P{displaystyle P}P and Q{displaystyle Q}Q; however this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be found described as the divergence of P{displaystyle P}P from, or with respect to Q{displaystyle Q}Q. This reflects the asymmetry in Bayesian inference, which starts from a prior Q{displaystyle Q}Q and updates to the posterior P{displaystyle P}P.



Basic example


Kullback[2] gives the following example (Table 2.1, Example 2.1). Let P{displaystyle P}P and Q{displaystyle Q}Q be the distributions shown in the table and figure. P{displaystyle P}P is the distribution on the left side of the figure, a binomial distribution with N=2{displaystyle N=2}N=2 and p=0.4{displaystyle p=0.4}{displaystyle p=0.4}. Q{displaystyle Q}Q is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes x=0{displaystyle x=0}x=0, 1{displaystyle 1}1, or 2{displaystyle 2}2 (i.e. X={0,1,2}{displaystyle {mathcal {X}}={0,1,2}}{displaystyle {mathcal {X}}={0,1,2}}), each with probability p=1/3{displaystyle p=1/3}p=1/3.


Two distributions to illustrate Kullback–Leibler divergence





















x 0 1 2
Distribution P(x) 0.36 0.48 0.16
Distribution Q(x) 0.333 0.333 0.333

The KL divergence is calculated using the definition Eq.1 as follows. This example uses the natural log with base e, designated ln{displaystyle operatorname {ln} }{displaystyle operatorname {ln} }.



DKL(Q∥P)=∑x∈XQ(x)ln⁡(P(x)Q(x))=0.333ln⁡(0.3330.36)+0.333ln⁡(0.3330.48)+0.333ln⁡(0.3330.16)=−0.02596+(−0.12176)+0.24408=0.09637{displaystyle {begin{aligned}D_{text{KL}}(Qparallel P)&=sum _{xin {mathcal {X}}}Q(x)ln left({frac {P(x)}{Q(x)}}right)\[6pt]&=0.333ln left({frac {0.333}{0.36}}right)+0.333ln left({frac {0.333}{0.48}}right)+0.333ln left({frac {0.333}{0.16}}right)\[6pt]&=-0.02596+(-0.12176)+0.24408\[6pt]&=0.09637end{aligned}}}{displaystyle {begin{aligned}D_{text{KL}}(Qparallel P)&=sum _{xin {mathcal {X}}}Q(x)ln left({frac {P(x)}{Q(x)}}right)\[6pt]&=0.333ln left({frac {0.333}{0.36}}right)+0.333ln left({frac {0.333}{0.48}}right)+0.333ln left({frac {0.333}{0.16}}right)\[6pt]&=-0.02596+(-0.12176)+0.24408\[6pt]&=0.09637end{aligned}}}


nats (see units of information).



Interpretations


The Kullback–Leibler divergence from Q{displaystyle Q}Q to P{displaystyle P}P is often denoted DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)}.


In the context of machine learning, DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} is often called the information gain achieved if Q{displaystyle Q}Q is used instead of P{displaystyle P}P. By analogy with information theory, it is also called the relative entropy of P{displaystyle P}P with respect to Q{displaystyle Q}Q. In the context of coding theory, DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} can be construed as measuring the expected number of extra bits required to code samples from P{displaystyle P}P using a code optimized for Q{displaystyle Q}Q rather than the code optimized for P{displaystyle P}P.


Expressed in the language of Bayesian inference, DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} is a measure of the information gained when one revises one's beliefs from the prior probability distribution Q{displaystyle Q}Q to the posterior probability distribution P{displaystyle P}P. In other words, it is the amount of information lost when Q{displaystyle Q}Q is used to approximate P{displaystyle P}P.[6] In applications, P{displaystyle P}P typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while Q{displaystyle Q}Q typically represents a theory, model, description, or approximation of P{displaystyle P}P. In order to find a distribution Q{displaystyle Q}Q that is closest to P{displaystyle P}P, we can minimize KL divergence and compute an information projection.


The Kullback–Leibler divergence is a special case of a broader class of divergences called f-divergences as well as the class of Bregman divergences. It is the only such divergence over probabilities that is a member of both classes. Although it is often intuited as a way of measuring the distance between probability distributions, the Kullback–Leibler divergence is not a true metric. It does not obey the triangle inequality, and in general DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} does not equal DKL(Q∥P){displaystyle D_{text{KL}}(Qparallel P)}{displaystyle D_{text{KL}}(Qparallel P)}. However, its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.



Characterization


Arthur Hobson proved that the Kullback–Leibler divergence is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy.[7] Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.


There is also a Bayesian characterization of the Kullback–Leibler divergence.[8]



Motivation




Illustration of the Kullback–Leibler (KL) divergence for two normal distributions. The typical asymmetry for the Kullback–Leibler divergence is clearly visible.


In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value xi{displaystyle x_{i}}x_{i} out of a set of possibilities X{displaystyle X}X can be seen as representing an implicit probability distribution q(xi)=2−li{displaystyle q(x_{i})=2^{-l_{i}}}q(x_{i})=2^{-l_{i}} over X{displaystyle X}X, where li{displaystyle l_{i}}l_{i} is the length of the code for xi{displaystyle x_{i}}x_{i} in bits. Therefore, the Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q{displaystyle Q}Q is used, compared to using a code based on the true distribution P{displaystyle P}P.


DKL(P∥Q)=−x∈Xp(x)log⁡q(x)+∑x∈Xp(x)log⁡p(x)=H(P,Q)−H(P){displaystyle {begin{aligned}D_{text{KL}}(Pparallel Q)&=-sum _{xin {mathcal {X}}}p(x)log q(x)+sum _{xin {mathcal {X}}}p(x)log p(x)\&=mathrm {H} (P,Q)-mathrm {H} (P)end{aligned}}}{displaystyle {begin{aligned}D_{text{KL}}(Pparallel Q)&=-sum _{xin {mathcal {X}}}p(x)log q(x)+sum _{xin {mathcal {X}}}p(x)log p(x)\&=mathrm {H} (P,Q)-mathrm {H} (P)end{aligned}}}

where H(P,Q){displaystyle mathrm {H} (P,Q)}{displaystyle mathrm {H} (P,Q)} is the cross entropy of P{displaystyle P}P and Q{displaystyle Q}Q, and H(P){displaystyle mathrm {H} (P)}{displaystyle mathrm {H} (P)} is the entropy of P{displaystyle P}P.


Note also that there is a relation between the Kullback–Leibler divergence and the "rate function" in the theory of large deviations.[9][10]



Properties


  • The Kullback–Leibler divergence is always non-negative,


DKL(P∥Q)≥0,{displaystyle D_{text{KL}}(Pparallel Q)geq 0,}{displaystyle D_{text{KL}}(Pparallel Q)geq 0,}

a result known as Gibbs' inequality, with DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} zero if and only if P=Q{displaystyle P=Q}P=Q almost everywhere. The entropy H(P){displaystyle mathrm {H} (P)}{displaystyle mathrm {H} (P)} thus sets a minimum value for the cross-entropy H(P,Q){displaystyle mathrm {H} (P,Q)}{displaystyle mathrm {H} (P,Q)}, the expected number of bits required when using a code based on Q{displaystyle Q}Q rather than P{displaystyle P}P; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value x{displaystyle x}x drawn from X{displaystyle X}X, if a code is used corresponding to the probability distribution Q{displaystyle Q}Q, rather than the "true" distribution P{displaystyle P}P.


  • The Kullback–Leibler divergence remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable x{displaystyle x}x to variable y(x){displaystyle y(x)}y(x), then, since P(x)dx=P(y)dy{displaystyle P(x)dx=P(y)dy}{displaystyle P(x)dx=P(y)dy} and Q(x)dx=Q(y)dy{displaystyle Q(x)dx=Q(y)dy}{displaystyle Q(x)dx=Q(y)dy} the Kullback–Leibler divergence may be rewritten:


DKL(P∥Q)=∫xaxbP(x)log⁡(P(x)Q(x))dx=∫yaybP(y)log⁡(P(y)dydxQ(y)dydx)dy=∫yaybP(y)log⁡(P(y)Q(y))dy{displaystyle {begin{aligned}D_{text{KL}}(Pparallel Q)&=int _{x_{a}}^{x_{b}}P(x)log left({frac {P(x)}{Q(x)}}right),dx\[6pt]&=int _{y_{a}}^{y_{b}}P(y)log left({frac {P(y),{frac {dy}{dx}}}{Q(y),{frac {dy}{dx}}}}right),dy=int _{y_{a}}^{y_{b}}P(y)log left({frac {P(y)}{Q(y)}}right),dyend{aligned}}}{displaystyle {begin{aligned}D_{text{KL}}(Pparallel Q)&=int _{x_{a}}^{x_{b}}P(x)log left({frac {P(x)}{Q(x)}}right),dx\[6pt]&=int _{y_{a}}^{y_{b}}P(y)log left({frac {P(y),{frac {dy}{dx}}}{Q(y),{frac {dy}{dx}}}}right),dy=int _{y_{a}}^{y_{b}}P(y)log left({frac {P(y)}{Q(y)}}right),dyend{aligned}}}

where ya=y(xa){displaystyle y_{a}=y(x_{a})}{displaystyle y_{a}=y(x_{a})} and yb=y(xb){displaystyle y_{b}=y(x_{b})}{displaystyle y_{b}=y(x_{b})}. Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the Kullback–Leibler divergence produces a dimensionally consistent quantity, since if x{displaystyle x}x is a dimensioned variable, P(x){displaystyle P(x)}P(x) and Q(x){displaystyle Q(x)}Q(x) are also dimensioned, since e.g. P(x)dx{displaystyle P(x)dx}{displaystyle P(x)dx} is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory[11] (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities.


  • The Kullback–Leibler divergence is additive for independent distributions in much the same way as Shannon entropy. If P1,P2{displaystyle P_{1},P_{2}}P_{1},P_{2} are independent distributions, with the joint distribution P(x,y)=P1(x)P2(y){displaystyle P(x,y)=P_{1}(x)P_{2}(y)}{displaystyle P(x,y)=P_{1}(x)P_{2}(y)}, and Q,Q1,Q2{displaystyle Q,Q_{1},Q_{2}}Q,Q_{1},Q_{2} likewise, then

DKL(P∥Q)=DKL(P1∥Q1)+DKL(P2∥Q2).{displaystyle D_{text{KL}}(Pparallel Q)=D_{text{KL}}(P_{1}parallel Q_{1})+D_{text{KL}}(P_{2}parallel Q_{2}).}{displaystyle D_{text{KL}}(Pparallel Q)=D_{text{KL}}(P_{1}parallel Q_{1})+D_{text{KL}}(P_{2}parallel Q_{2}).}

  • The Kullback–Leibler divergence DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} is convex in the pair of probability mass functions (p,q){displaystyle (p,q)}(p,q), i.e. if (p1,q1){displaystyle (p_{1},q_{1})}{displaystyle (p_{1},q_{1})} and (p2,q2){displaystyle (p_{2},q_{2})}{displaystyle (p_{2},q_{2})} are two pairs of probability mass functions, then

DKL(λp1+(1−λ)p2∥λq1+(1−λ)q2)≤λDKL(p1∥q1)+(1−λ)DKL(p2∥q2) for 0≤λ1.{displaystyle D_{text{KL}}(lambda p_{1}+(1-lambda )p_{2}parallel lambda q_{1}+(1-lambda )q_{2})leq lambda D_{text{KL}}(p_{1}parallel q_{1})+(1-lambda )D_{text{KL}}(p_{2}parallel q_{2}){text{ for }}0leq lambda leq 1.}{displaystyle D_{text{KL}}(lambda p_{1}+(1-lambda )p_{2}parallel lambda q_{1}+(1-lambda )q_{2})leq lambda D_{text{KL}}(p_{1}parallel q_{1})+(1-lambda )D_{text{KL}}(p_{2}parallel q_{2}){text{ for }}0leq lambda leq 1.}


Examples



Multivariate normal distributions


Suppose that we have two multivariate normal distributions, with means μ0,μ1{displaystyle mu _{0},mu _{1}}mu _{0},mu _{1} and with (non-singular) covariance matrices Σ0,Σ1.{displaystyle Sigma _{0},Sigma _{1}.}{displaystyle Sigma _{0},Sigma _{1}.} If the two distributions have the same dimension, k{displaystyle k}k, then the Kullback–Leibler divergence between the distributions is as follows:[12]:p. 13


DKL(N0∥N1)=12(tr⁡1−0)+(μ1−μ0)TΣ1−1(μ1−μ0)−k+ln⁡(detΣ1detΣ0)).{displaystyle D_{text{KL}}({mathcal {N}}_{0}parallel {mathcal {N}}_{1})={frac {1}{2}}left(operatorname {tr} left(Sigma _{1}^{-1}Sigma _{0}right)+(mu _{1}-mu _{0})^{mathsf {T}}Sigma _{1}^{-1}(mu _{1}-mu _{0})-k+ln left({frac {det Sigma _{1}}{det Sigma _{0}}}right)right).}{displaystyle D_{text{KL}}({mathcal {N}}_{0}parallel {mathcal {N}}_{1})={frac {1}{2}}left(operatorname {tr} left(Sigma _{1}^{-1}Sigma _{0}right)+(mu _{1}-mu _{0})^{mathsf {T}}Sigma _{1}^{-1}(mu _{1}-mu _{0})-k+ln left({frac {det Sigma _{1}}{det Sigma _{0}}}right)right).}

The logarithm in the last term must be taken to base e since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by ln(2){displaystyle ln(2)}ln(2) yields the divergence in bits.


A special case, and a common quantity in variational inference, is the KL-divergence between a diagonal multivariate normal, and a standard normal distribution:


DKL(N((μ1,…k)T,diag⁡12,…k2))∥N(0,I))=12∑i=1k(σi2+μi2−ln⁡i2)−1).{displaystyle D_{text{KL}}left({mathcal {N}}left((mu _{1},ldots ,mu _{k})^{mathsf {T}},operatorname {diag} (sigma _{1}^{2},ldots ,sigma _{k}^{2})right)parallel {mathcal {N}}left(mathbf {0} ,mathbf {I} right)right)={1 over 2}sum _{i=1}^{k}(sigma _{i}^{2}+mu _{i}^{2}-ln(sigma _{i}^{2})-1).}{displaystyle D_{text{KL}}left({mathcal {N}}left((mu _{1},ldots ,mu _{k})^{mathsf {T}},operatorname {diag} (sigma _{1}^{2},ldots ,sigma _{k}^{2})right)parallel {mathcal {N}}left(mathbf {0} ,mathbf {I} right)right)={1 over 2}sum _{i=1}^{k}(sigma _{i}^{2}+mu _{i}^{2}-ln(sigma _{i}^{2})-1).}


Relation to metrics


One might be tempted to call the Kullback–Leibler divergence a "distance metric" on the space of probability distributions, but this would not be correct as it is not symmetric – that is, DKL(P∥Q)≠DKL(Q∥P){displaystyle D_{text{KL}}(Pparallel Q)neq D_{text{KL}}(Qparallel P)}{displaystyle D_{text{KL}}(Pparallel Q)neq D_{text{KL}}(Qparallel P)} – nor does it satisfy the triangle inequality. Even so, being a premetric, it generates a topology on the space of probability distributions. More concretely, if {P1,P2,…}{displaystyle {P_{1},P_{2},ldots }}{displaystyle {P_{1},P_{2},ldots }} is a sequence of distributions such that








limn→DKL(Pn∥Q)=0{displaystyle lim _{nto infty }D_{text{KL}}(P_{n}parallel Q)=0}{displaystyle lim _{nto infty }D_{text{KL}}(P_{n}parallel Q)=0}












 



 



 



 





(1)




then it is said that








Pn→DQ.{displaystyle P_{n}{xrightarrow {D}}Q.}{displaystyle P_{n}{xrightarrow {D}}Q.}












 



 



 



 





(2)




Pinsker's inequality entails that








Pn→DP⇒Pn→TVP,{displaystyle P_{n}{xrightarrow {D}}PRightarrow P_{n}{xrightarrow {TV}}P,}{displaystyle P_{n}{xrightarrow {D}}PRightarrow P_{n}{xrightarrow {TV}}P,}












 



 



 



 





(3)




where the latter stands for the usual convergence in total variation.


Following Rényi (1970, 1961)[13][14]



Fisher information metric


The Kullback–Leibler divergence is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions P{displaystyle P}P and Q{displaystyle Q}Q are both parameterized by some (possibly multi-dimensional) parameter θ{displaystyle theta }theta . Consider then two close-by values of P=P(θ){displaystyle P=P(theta )}P=P(theta ) and Q=P(θ0){displaystyle Q=P(theta _{0})}Q=P(theta _{0}) so that the parameter θ{displaystyle theta }theta differs by only a small amount from the parameter value θ0{displaystyle theta _{0}}theta _{0}. Specifically, up to first order one has (using the Einstein summation convention)


P(θ)=P(θ0)+ΔθjPj(θ0)+⋯{displaystyle P(theta )=P(theta _{0})+Delta theta ^{j}P_{j}(theta _{0})+cdots }{displaystyle P(theta )=P(theta _{0})+Delta theta ^{j}P_{j}(theta _{0})+cdots }

with Δθj=(θθ0)j{displaystyle Delta theta ^{j}=(theta -theta _{0})^{j}}Delta theta ^{j}=(theta -theta _{0})^{j} a small change of θ{displaystyle theta }theta in the j{displaystyle j}j direction, and Pj(θ0)=∂P∂θj(θ0){displaystyle P_{j}left(theta _{0}right)={frac {partial P}{partial theta ^{j}}}(theta _{0})}{displaystyle P_{j}left(theta _{0}right)={frac {partial P}{partial theta ^{j}}}(theta _{0})} the corresponding rate of change in the probability distribution. Since the Kullback–Leibler divergence has an absolute minimum 0 for P=Q{displaystyle P=Q}P=Q, i.e. θ0{displaystyle theta =theta _{0}}theta =theta _{0}, it changes only to second order in the small parameters Δθj{displaystyle Delta theta ^{j}}Delta theta ^{j}. More formally, as for any minimum, the first derivatives of the divergence vanish


θj|θ0DKL(P(θ)∥P(θ0))=0,{displaystyle left.{frac {partial }{partial theta ^{j}}}right|_{theta =theta _{0}}D_{KL}(P(theta )parallel P(theta _{0}))=0,}{displaystyle left.{frac {partial }{partial theta ^{j}}}right|_{theta =theta _{0}}D_{KL}(P(theta )parallel P(theta _{0}))=0,}

and by the Taylor expansion one has up to second order


DKL(P(θ)∥P(θ0))=12Δθθkgjk(θ0)+⋯{displaystyle D_{text{KL}}(P(theta )parallel P(theta _{0}))={frac {1}{2}}Delta theta ^{j}Delta theta ^{k}g_{jk}(theta _{0})+cdots }{displaystyle D_{text{KL}}(P(theta )parallel P(theta _{0}))={frac {1}{2}}Delta theta ^{j}Delta theta ^{k}g_{jk}(theta _{0})+cdots }

where the Hessian matrix of the divergence


gjk(θ0)=∂2∂θj∂θk|θ0DKL(P(θ)∥P(θ0)){displaystyle g_{jk}(theta _{0})=left.{frac {partial ^{2}}{partial theta ^{j},partial theta ^{k}}}right|_{theta =theta _{0}}D_{text{KL}}(P(theta )parallel P(theta _{0}))}{displaystyle g_{jk}(theta _{0})=left.{frac {partial ^{2}}{partial theta ^{j},partial theta ^{k}}}right|_{theta =theta _{0}}D_{text{KL}}(P(theta )parallel P(theta _{0}))}

must be positive semidefinite. Letting θ0{displaystyle theta _{0}}theta _{0} vary (and dropping the subindex 0) the Hessian gjk(θ){displaystyle g_{jk}(theta )}g_{jk}(theta ) defines a (possibly degenerate) Riemannian metric on the θ parameter space, called the Fisher information metric.



Fisher information metric theorem


When p(x,ρ){displaystyle p_{(x,rho )}}{displaystyle p_{(x,rho )}} satisfies the following regulatory conditions: log⁡(p)∂ρ,∂2log⁡(p)∂ρ2,∂3log⁡(p)∂ρ3{displaystyle {tfrac {partial log(p)}{partial rho }},{tfrac {partial ^{2}log(p)}{partial rho ^{2}}},{tfrac {partial ^{3}log(p)}{partial rho ^{3}}}}{displaystyle {tfrac {partial log(p)}{partial rho }},{tfrac {partial ^{2}log(p)}{partial rho ^{2}}},{tfrac {partial ^{3}log(p)}{partial rho ^{3}}}} exist,


|∂p∂ρ|<F(x):∫x=0∞F(x)dx<∞,|∂2p∂ρ2|<G(x):∫x=0∞G(x)dx<∞|∂3log⁡(p)∂ρ3|<H(x):∫x=0∞p(x,0)H(x)dx<ξ<∞{displaystyle {begin{aligned}left|{frac {partial p}{partial rho }}right|&<F(x):int _{x=0}^{infty }F(x),dx<infty ,\left|{frac {partial ^{2}p}{partial rho ^{2}}}right|&<G(x):int _{x=0}^{infty }G(x),dx<infty \left|{frac {partial ^{3}log(p)}{partial rho ^{3}}}right|&<H(x):int _{x=0}^{infty }p(x,0)H(x),dx<xi <infty end{aligned}}}{displaystyle {begin{aligned}left|{frac {partial p}{partial rho }}right|&<F(x):int _{x=0}^{infty }F(x),dx<infty ,\left|{frac {partial ^{2}p}{partial rho ^{2}}}right|&<G(x):int _{x=0}^{infty }G(x),dx<infty \left|{frac {partial ^{3}log(p)}{partial rho ^{3}}}right|&<H(x):int _{x=0}^{infty }p(x,0)H(x),dx<xi <infty end{aligned}}}

where ξ is independent of ρ


x=0∞p(x,ρ)∂ρ=0dx=∫x=0∞2p(x,ρ)∂ρ2|ρ=0dx=0{displaystyle left.int _{x=0}^{infty }{frac {partial p(x,rho )}{partial rho }}right|_{rho =0},dx=left.int _{x=0}^{infty }{frac {partial ^{2}p(x,rho )}{partial rho ^{2}}}right|_{rho =0},dx=0}{displaystyle left.int _{x=0}^{infty }{frac {partial p(x,rho )}{partial rho }}right|_{rho =0},dx=left.int _{x=0}^{infty }{frac {partial ^{2}p(x,rho )}{partial rho ^{2}}}right|_{rho =0},dx=0}

then:


D(p(x,0)∥p(x,ρ))=cρ22+O(ρ3) as ρ0.{displaystyle {mathcal {D}}(p(x,0)parallel p(x,rho ))={frac {crho ^{2}}{2}}+{mathcal {O}}(rho ^{3}){text{ as }}rho to 0.}{displaystyle {mathcal {D}}(p(x,0)parallel p(x,rho ))={frac {crho ^{2}}{2}}+{mathcal {O}}(rho ^{3}){text{ as }}rho to 0.}


Relation to other quantities of information theory


Many of the other quantities of information theory can be interpreted as applications of the Kullback–Leibler divergence to specific cases.



Self-information



The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.


When applied to a discrete random variable, the self-information can be represented as[citation needed]


I⁡(m)=DKL(δim∥{pi}),{displaystyle operatorname {operatorname {I} } (m)=D_{text{KL}}(delta _{im}parallel {p_{i}}),}{displaystyle operatorname {operatorname {I} } (m)=D_{text{KL}}(delta _{im}parallel {p_{i}}),}

is the Kullback–Leibler divergence of the probability distribution P(i){displaystyle P(i)}{displaystyle P(i)} from a Kronecker delta representing certainty that i=m{displaystyle i=m}{displaystyle i=m} — i.e. the number of extra bits that must be transmitted to identify i{displaystyle i}i if only the probability distribution P(i){displaystyle P(i)}{displaystyle P(i)} is available to the receiver, not the fact that i=m{displaystyle i=m}{displaystyle i=m}.



Mutual information


The mutual information,[citation needed]


I⁡(X;Y)=DKL(P(X,Y)∥P(X)P(Y))=EX⁡{DKL(P(Y∣X)∥P(Y))}=EY⁡{DKL(P(X∣Y)∥P(X))}{displaystyle {begin{aligned}operatorname {I} (X;Y)&=D_{text{KL}}(P(X,Y)parallel P(X)P(Y))\&=operatorname {E} _{X}{D_{text{KL}}(P(Ymid X)parallel P(Y))}\&=operatorname {E} _{Y}{D_{text{KL}}(P(Xmid Y)parallel P(X))}end{aligned}}}{displaystyle {begin{aligned}operatorname {I} (X;Y)&=D_{text{KL}}(P(X,Y)parallel P(X)P(Y))\&=operatorname {E} _{X}{D_{text{KL}}(P(Ymid X)parallel P(Y))}\&=operatorname {E} _{Y}{D_{text{KL}}(P(Xmid Y)parallel P(X))}end{aligned}}}

is the Kullback–Leibler divergence of the product P(X)P(Y){displaystyle P(X)P(Y)}{displaystyle P(X)P(Y)} of the two marginal probability distributions from the joint probability distribution P(X,Y){displaystyle P(X,Y)}{displaystyle P(X,Y)} — i.e. the expected number of extra bits that must be transmitted to identify X{displaystyle X}X and Y{displaystyle Y}Y if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability P(X,Y){displaystyle P(X,Y)}{displaystyle P(X,Y)} is known, it is the expected number of extra bits that must on average be sent to identify Y{displaystyle Y}Y if the value of X{displaystyle X}X is not already known to the receiver.



Shannon entropy


The Shannon entropy,[citation needed]














H(X)={displaystyle mathrm {H} (X)=}{displaystyle mathrm {H} (X)=} E⁡[IX⁡(x)]{displaystyle operatorname {E} [operatorname {I} _{X}(x)]}{displaystyle operatorname {E} [operatorname {I} _{X}(x)]} (s1)
={displaystyle =}= log⁡(N)−DKL(pX(x)∥PU(X)){displaystyle log(N)-D_{text{KL}}(p_{X}(x)parallel P_{U}(X))}{displaystyle log(N)-D_{text{KL}}(p_{X}(x)parallel P_{U}(X))} (s2)


is the number of bits which would have to be transmitted to identify X{displaystyle X}X from N{displaystyle N}N equally likely possibilities, less the Kullback–Leibler divergence of the uniform distribution on the random variates of X{displaystyle X}X, PU(X){displaystyle P_{U}(X)}{displaystyle P_{U}(X)}, from the true distribution P(X){displaystyle P(X)}P(X) — i.e. less the expected number of bits saved, which would have had to be sent if the value of X{displaystyle X}X were coded according to the uniform distribution PU(X){displaystyle P_{U}(X)}{displaystyle P_{U}(X)} rather than the true distribution P(X){displaystyle P(X)}P(X).



Conditional entropy


The conditional entropy,[citation needed]
























H(X∣Y)={displaystyle mathrm {H} (Xmid Y)=}{displaystyle mathrm {H} (Xmid Y)=} log⁡(N)−DKL(P(X,Y)∥PU(X)P(Y)){displaystyle log(N)-D_{text{KL}}(P(X,Y)parallel P_{U}(X)P(Y))}{displaystyle log(N)-D_{text{KL}}(P(X,Y)parallel P_{U}(X)P(Y))}
={displaystyle =}= log⁡(N)−DKL(P(X,Y)∥P(X)P(Y))−DKL(P(X)∥PU(X)){displaystyle log(N)-D_{text{KL}}(P(X,Y)parallel P(X)P(Y))-D_{text{KL}}(P(X)parallel P_{U}(X))}{displaystyle log(N)-D_{text{KL}}(P(X,Y)parallel P(X)P(Y))-D_{text{KL}}(P(X)parallel P_{U}(X))} (c1)
={displaystyle =}= H(X)−I⁡(X;Y){displaystyle mathrm {H} (X)-operatorname {I} (X;Y)}{displaystyle mathrm {H} (X)-operatorname {I} (X;Y)}
={displaystyle =}= log⁡(N)−EY⁡[DKL(P(X∣Y)∥PU(X))]{displaystyle log(N)-operatorname {E} _{Y}{bigl [}D_{text{KL}}(P(Xmid Y)parallel P_{U}(X)){bigr ]}}{displaystyle log(N)-operatorname {E} _{Y}{bigl [}D_{text{KL}}(P(Xmid Y)parallel P_{U}(X)){bigr ]}} (c2)


is the number of bits which would have to be transmitted to identify X{displaystyle X}X from N{displaystyle N}N equally likely possibilities, less the Kullback–Leibler divergence of the product distribution PU(X)P(Y){displaystyle P_{U}(X)P(Y)}{displaystyle P_{U}(X)P(Y)} from the true joint distribution P(X,Y){displaystyle P(X,Y)}{displaystyle P(X,Y)} — i.e. less the expected number of bits saved which would have had to be sent if the value of X{displaystyle X}X were coded according to the uniform distribution PU(X){displaystyle P_{U}(X)}{displaystyle P_{U}(X)} rather than the conditional distribution P(X|Y){displaystyle P(X|Y)}P(X|Y) of X{displaystyle X}X given Y{displaystyle Y}Y.



Cross entropy


The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. The cross entropy for two distributions p and q over the same probability space is thus defined as follows:[citation needed]


H(p,q)=Ep⁡[−log⁡(q)]=H(p)+DKL(p∥q).{displaystyle mathrm {H} (p,q)=operatorname {E} _{p}[-log(q)]=mathrm {H} (p)+D_{text{KL}}(pparallel q).}{displaystyle mathrm {H} (p,q)=operatorname {E} _{p}[-log(q)]=mathrm {H} (p)+D_{text{KL}}(pparallel q).}


Bayesian updating


In Bayesian statistics the Kullback–Leibler divergence can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution: p(x)→p(x∣I){displaystyle p(x)to p(xmid I)}{displaystyle p(x)to p(xmid I)}. If some new fact Y=y{displaystyle Y=y}{displaystyle Y=y} is discovered, it can be used to update the posterior distribution for X{displaystyle X}X from p(x∣I){displaystyle p(xmid I)}{displaystyle p(xmid I)} to a new posterior distribution p(x∣y,I){displaystyle p(xmid y,I)}{displaystyle p(xmid y,I)} using Bayes' theorem:


p(x∣y,I)=p(y∣x,I)p(x∣I)p(y∣I){displaystyle p(xmid y,I)={frac {p(ymid x,I)p(xmid I)}{p(ymid I)}}}{displaystyle p(xmid y,I)={frac {p(ymid x,I)p(xmid I)}{p(ymid I)}}}

This distribution has a new entropy:


H(p(−y,I))=−xp(x∣y,I)log⁡p(x∣y,I),{displaystyle mathrm {H} {big (}p(-mid y,I){big )}=-sum _{x}p(xmid y,I)log p(xmid y,I),}{displaystyle mathrm {H} {big (}p(-mid y,I){big )}=-sum _{x}p(xmid y,I)log p(xmid y,I),}

which may be less than or greater than the original entropy H(p(−I)){displaystyle mathrm {H} (p(-mid I))}{displaystyle mathrm {H} (p(-mid I))}. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on p(x∣I){displaystyle p(xmid I)}{displaystyle p(xmid I)} instead of a new code based on p(x∣y,I){displaystyle p(xmid y,I)}{displaystyle p(xmid y,I)} would have added an expected number of bits:


DKL(p(−y,I)∥p(−I))=∑xp(x∣y,I)log⁡(p(x∣y,I)p(x∣I)){displaystyle D_{text{KL}}{big (}p(-mid y,I)parallel p(-mid I){big )}=sum _{x}p(xmid y,I)log left({frac {p(xmid y,I)}{p(xmid I)}}right)}{displaystyle D_{text{KL}}{big (}p(-mid y,I)parallel p(-mid I){big )}=sum _{x}p(xmid y,I)log left({frac {p(xmid y,I)}{p(xmid I)}}right)}

to the message length. This therefore represents the amount of useful information, or information gain, about X{displaystyle X}X, that we can estimate has been learned by discovering Y=y{displaystyle Y=y}{displaystyle Y=y}.


If a further piece of data, Y2=y2{displaystyle Y_{2}=y_{2}}{displaystyle Y_{2}=y_{2}}, subsequently comes in, the probability distribution for x{displaystyle x}x can be updated further, to give a new best guess p(x∣y1,y2,I){displaystyle p(xmid y_{1},y_{2},I)}{displaystyle p(xmid y_{1},y_{2},I)}. If one reinvestigates the information gain for using p(x∣y1,I){displaystyle p(xmid y_{1},I)}{displaystyle p(xmid y_{1},I)} rather than p(x∣I){displaystyle p(xmid I)}{displaystyle p(xmid I)}, it turns out that it may be either greater or less than previously estimated:



xp(x∣y1,y2,I)log⁡(p(x∣y1,y2,I)p(x∣I)){displaystyle sum _{x}p(xmid y_{1},y_{2},I)log left({frac {p(xmid y_{1},y_{2},I)}{p(xmid I)}}right)}{displaystyle sum _{x}p(xmid y_{1},y_{2},I)log left({frac {p(xmid y_{1},y_{2},I)}{p(xmid I)}}right)} may be ≤ or > than xp(x∣y1,I)log⁡(p(x∣y1,I)p(x∣I)){displaystyle displaystyle sum _{x}p(xmid y_{1},I)log left({frac {p(xmid y_{1},I)}{p(xmid I)}}right)}{displaystyle displaystyle sum _{x}p(xmid y_{1},I)log left({frac {p(xmid y_{1},I)}{p(xmid I)}}right)}

and so the combined information gain does not obey the triangle inequality:



DKL(p(−y1,y2,I)∥p(−I)){displaystyle D_{text{KL}}{big (}p(-mid y_{1},y_{2},I)parallel p(-mid I){big )}}{displaystyle D_{text{KL}}{big (}p(-mid y_{1},y_{2},I)parallel p(-mid I){big )}} may be <, = or > than DKL(p(−y1,y2,I)∥p(−y1,I))+DKL(p(−y1,I)∥p(−I)){displaystyle D_{text{KL}}{big (}p(-mid y_{1},y_{2},I)parallel p(-mid y_{1},I){big )}+D_{text{KL}}{big (}p(-mid y_{1},I)parallel p(-mid I){big )}}{displaystyle D_{text{KL}}{big (}p(-mid y_{1},y_{2},I)parallel p(-mid y_{1},I){big )}+D_{text{KL}}{big (}p(-mid y_{1},I)parallel p(-mid I){big )}}

All one can say is that on average, averaging using p(y2∣y1,x,I){displaystyle p(y_{2}mid y_{1},x,I)}{displaystyle p(y_{2}mid y_{1},x,I)}, the two sides will average out.



Bayesian experimental design


A common goal in Bayesian experimental design is to maximise the expected Kullback–Leibler divergence between the prior and the posterior.[15] When posteriors are approximated to be Gaussian distributions, a design maximising the expected Kullback–Leibler divergence is called Bayes d-optimal.



Discrimination information


The Kullback–Leibler divergence DKL(p(x∣H1)∥p(x∣H0)){textstyle D_{text{KL}}{bigl (}p(xmid H_{1})parallel p(xmid H_{0}){bigr )}}{textstyle D_{text{KL}}{bigl (}p(xmid H_{1})parallel p(xmid H_{0}){bigr )}} can also be interpreted as the expected discrimination information for H1{displaystyle H_{1}}H_{1} over H0{displaystyle H_{0}}H_{0}: the mean information per sample for discriminating in favor of a hypothesis H1{displaystyle H_{1}}H_{1} against a hypothesis H0{displaystyle H_{0}}H_{0}, when hypothesis H1{displaystyle H_{1}}H_{1} is true.[16] Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for H1{displaystyle H_{1}}H_{1} over H0{displaystyle H_{0}}H_{0} to be expected from each sample.


The expected weight of evidence for H1{displaystyle H_{1}}H_{1} over H0{displaystyle H_{0}}H_{0} is not the same as the information gain expected per sample about the probability distribution p(H){displaystyle p(H)}{displaystyle p(H)} of the hypotheses,


DKL(p(x∣H1)∥p(x∣H0))≠IG=DKL(p(H∣x)∥p(H∣I)).{displaystyle D_{text{KL}}(p(xmid H_{1})parallel p(xmid H_{0}))neq IG=D_{text{KL}}(p(Hmid x)parallel p(Hmid I)).}{displaystyle D_{text{KL}}(p(xmid H_{1})parallel p(xmid H_{0}))neq IG=D_{text{KL}}(p(Hmid x)parallel p(Hmid I)).}

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.


On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.



Principle of minimum discrimination information


The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution f{displaystyle f}f should be chosen which is as hard to discriminate from the original distribution f0{displaystyle f_{0}}f_{0} as possible; so that the new data produces as small an information gain DKL(f∥f0){displaystyle D_{text{KL}}(fparallel f_{0})}{displaystyle D_{text{KL}}(fparallel f_{0})} as possible.


For example, if one had a prior distribution p(x,a){displaystyle p(x,a)}{displaystyle p(x,a)} over x{displaystyle x}x and a{displaystyle a}a, and subsequently learnt the true distribution of a{displaystyle a}a was u(a){displaystyle u(a)}{displaystyle u(a)}, then the Kullback–Leibler divergence between the new joint distribution for x{displaystyle x}x and a{displaystyle a}a, q(x∣a)u(a){displaystyle q(xmid a)u(a)}{displaystyle q(xmid a)u(a)}, and the earlier prior distribution would be:


DKL(q(x∣a)u(a)∥p(x,a))=Eu(a)⁡{DKL(q(x∣a)∥p(x∣a))}+DKL(u(a)∥p(a)),{displaystyle D_{text{KL}}(q(xmid a)u(a)parallel p(x,a))=operatorname {E} _{u(a)}left{D_{text{KL}}(q(xmid a)parallel p(xmid a))right}+D_{text{KL}}(u(a)parallel p(a)),}{displaystyle D_{text{KL}}(q(xmid a)u(a)parallel p(x,a))=operatorname {E} _{u(a)}left{D_{text{KL}}(q(xmid a)parallel p(xmid a))right}+D_{text{KL}}(u(a)parallel p(a)),}

i.e. the sum of the Kullback–Leibler divergence of p(a){displaystyle p(a)}p(a) the prior distribution for a{displaystyle a}a from the updated distribution u(a){displaystyle u(a)}{displaystyle u(a)}, plus the expected value (using the probability distribution u(a){displaystyle u(a)}{displaystyle u(a)}) of the Kullback–Leibler divergence of the prior conditional distribution p(x∣a){displaystyle p(xmid a)}{displaystyle p(xmid a)} from the new conditional distribution q(x∣a{displaystyle q(xmid a}{displaystyle q(xmid a}. (Note that often the later expected value is called the conditional Kullback–Leibler divergence (or conditional relative entropy) and denoted by DKL(q(x∣a)∥p(x∣a)){displaystyle D_{text{KL}}(q(xmid a)parallel p(xmid a))}{displaystyle D_{text{KL}}(q(xmid a)parallel p(xmid a))}[17]:p. 22) This is minimized if q(x∣a)=p(x∣a){displaystyle q(xmid a)=p(xmid a)}{displaystyle q(xmid a)=p(xmid a)} over the whole support of u(a){displaystyle u(a)}{displaystyle u(a)}; and we note that this result incorporates Bayes' theorem, if the new distribution u(a){displaystyle u(a)}{displaystyle u(a)} is in fact a δ function representing certainty that a{displaystyle a}a has one particular value.


MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the Kullback–Leibler divergence continues to be just as relevant.


In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising the Kullback–Leibler divergence from m{displaystyle m}m to p{displaystyle p}p with respect to m{displaystyle m}m is equivalent to minimizing the cross-entropy of p{displaystyle p}p and m{displaystyle m}m, since


H(p,m)=H(p)+DKL(p∥m),{displaystyle mathrm {H} (p,m)=mathrm {H} (p)+D_{text{KL}}(pparallel m),}{displaystyle mathrm {H} (p,m)=mathrm {H} (p)+D_{text{KL}}(pparallel m),}

which is appropriate if one is trying to choose an adequate approximation to p{displaystyle p}p. However, this is just as often not the task one is trying to achieve. Instead, just as often it is m{displaystyle m}m that is some fixed prior reference measure, and p{displaystyle p}p that one is attempting to optimise by minimising DKL(p∥m){displaystyle D_{text{KL}}(pparallel m)}{displaystyle D_{text{KL}}(pparallel m)} subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be DKL(p∥m){displaystyle D_{text{KL}}(pparallel m)}{displaystyle D_{text{KL}}(pparallel m)}, rather than H(p,m){displaystyle mathrm {H} (p,m)}{displaystyle mathrm {H} (p,m)}.



Relationship to available work




Pressure versus volume plot of available work from a mole of argon gas relative to ambient, calculated as To{displaystyle T_{o}}T_{o} times the Kullback–Leibler divergence.


Surprisals[18] add where probabilities multiply. The surprisal for an event of probability p{displaystyle p}p is defined as s=kln⁡(1/p){displaystyle s=kln(1/p)}{displaystyle s=kln(1/p)}. If k{displaystyle k}k is {1,1/ln⁡2,1.38×10−23}{displaystyle left{1,1/ln 2,1.38times 10^{-23}right}}{displaystyle left{1,1/ln 2,1.38times 10^{-23}right}} then surprisal is in {{displaystyle {}{nats, bits, or J/K}{displaystyle J/K}}J/K} so that, for instance, there are N{displaystyle N}N bits of surprisal for landing all "heads" on a toss of N{displaystyle N}N coins.


Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal S{displaystyle S}S (entropy) for a given set of control parameters (like pressure P{displaystyle P}P or volume V{displaystyle V}V). This constrained entropy maximization, both classically[19] and quantum mechanically,[20] minimizes Gibbs availability in entropy units[21]A≡kln⁡(Z){displaystyle Aequiv -kln(Z)}{displaystyle Aequiv -kln(Z)} where Z{displaystyle Z}Z is a constrained multiplicity or partition function.


When temperature T{displaystyle T}T is fixed, free energy (A{displaystyle Ttimes A}Ttimes A) is also minimized. Thus if T,V{displaystyle T,V}T,V and number of molecules N{displaystyle N}N are constant, the Helmholtz free energy F≡U−TS{displaystyle Fequiv U-TS}{displaystyle Fequiv U-TS} (where U{displaystyle U}U is energy) is minimized as a system "equilibrates." If T{displaystyle T}T and P{displaystyle P}P are held constant (say during processes in your body), the Gibbs free energy G=U+PV−TS{displaystyle G=U+PV-TS}{displaystyle G=U+PV-TS} is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature To{displaystyle T_{o}}T_{o} and pressure Po{displaystyle P_{o}}P_{o} is W=ΔG=NkToΘ(V/Vo){displaystyle W=Delta G=NkT_{o}Theta (V/V_{o})}{displaystyle W=Delta G=NkT_{o}Theta (V/V_{o})} where Vo=NkTo/Po{displaystyle V_{o}=NkT_{o}/P_{o}}V_{o}=NkT_{o}/P_{o} and Θ(x)=x−1−ln⁡x≥0{displaystyle Theta (x)=x-1-ln xgeq 0}{displaystyle Theta (x)=x-1-ln xgeq 0} (see also Gibbs inequality).


More generally[22] the work available relative to some ambient is obtained by multiplying ambient temperature To{displaystyle T_{o}}T_{o} by Kullback–Leibler divergence or net surprisal ΔI≥0,{displaystyle Delta Igeq 0,}{displaystyle Delta Igeq 0,} defined as the average value of kln⁡(p/po){displaystyle kln(p/p_{o})}kln(p/p_{o}) where po{displaystyle p_{o}}p_{o} is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of Vo{displaystyle V_{o}}V_{o} and To{displaystyle T_{o}}T_{o} is thus W=ToΔI{displaystyle W=T_{o}Delta I}{displaystyle W=T_{o}Delta I}, where Kullback–Leibler divergence


ΔI=Nk[Θ(VVo)+32Θ(TTo)].{displaystyle Delta I=Nkleft[Theta left({frac {V}{V_{o}}}right)+{frac {3}{2}}Theta left({frac {T}{T_{o}}}right)right].}{displaystyle Delta I=Nkleft[Theta left({frac {V}{V_{o}}}right)+{frac {3}{2}}Theta left({frac {T}{T_{o}}}right)right].}

The resulting contours of constant Kullback–Leibler divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[23] Thus Kullback–Leibler divergence measures thermodynamic availability in bits.



Quantum information theory


For density matrices P{displaystyle P}P and Q{displaystyle Q}Q on a Hilbert space, the K–L divergence (or quantum relative entropy as it is often called in this case) from Q{displaystyle Q}Q to P{displaystyle P}P is defined to be


DKL(P∥Q)=Tr⁡(P(log⁡(P)−log⁡(Q))).{displaystyle D_{text{KL}}(Pparallel Q)=operatorname {Tr} (P(log(P)-log(Q))).}{displaystyle D_{text{KL}}(Pparallel Q)=operatorname {Tr} (P(log(P)-log(Q))).}

In quantum information science the minimum of DKL(P∥Q){displaystyle D_{text{KL}}(Pparallel Q)}{displaystyle D_{text{KL}}(Pparallel Q)} over all separable states Q{displaystyle Q}Q can also be used as a measure of entanglement in the state P{displaystyle P}P.



Relationship between models and reality


Just as Kullback–Leibler divergence of "actual from ambient" measures thermodynamic availability, Kullback–Leibler divergence of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case Kullback–Leibler divergence describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.


Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers[24] and a book[25] by Burnham and Anderson. In a nutshell the Kullback–Leibler divergence of reality from a model may be estimated, to within a constant additive term, by a function (like the squares summed) of the deviations observed between data and the model's predictions. Estimates of such divergence for models that share the same additive term can in turn be used to select among models.


When trying to fit parametrized models to data there are various estimators which attempt to minimize Kullback–Leibler divergence, such as maximum likelihood and maximum spacing estimators.



Symmetrised divergence


Kullback and Leibler themselves actually defined the divergence as:


DKL(P∥Q)+DKL(Q∥P){displaystyle D_{text{KL}}(Pparallel Q)+D_{text{KL}}(Qparallel P)}{displaystyle D_{text{KL}}(Pparallel Q)+D_{text{KL}}(Qparallel P)}

which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification problems, where P{displaystyle P}P and Q{displaystyle Q}Q are the conditional pdfs of a feature under two different classes.


An alternative is given via the λ{displaystyle lambda }lambda divergence,


(P∥Q)=λDKL(P∥λP+(1−λ)Q)+(1−λ)DKL(Q∥λP+(1−λ)Q),{displaystyle D_{lambda }(Pparallel Q)=lambda D_{text{KL}}(Pparallel lambda P+(1-lambda )Q)+(1-lambda )D_{text{KL}}(Qparallel lambda P+(1-lambda )Q),}{displaystyle D_{lambda }(Pparallel Q)=lambda D_{text{KL}}(Pparallel lambda P+(1-lambda )Q)+(1-lambda )D_{text{KL}}(Qparallel lambda P+(1-lambda )Q),}

which can be interpreted as the expected information gain about X{displaystyle X}X from discovering which probability distribution X{displaystyle X}X is drawn from, P{displaystyle P}P or Q{displaystyle Q}Q, if they currently have probabilities λ{displaystyle lambda }lambda and 1−λ{displaystyle 1-lambda }1-lambda respectively.[clarification needed][citation needed]


The value λ=0.5{displaystyle lambda =0.5}{displaystyle lambda =0.5} gives the Jensen–Shannon divergence, defined by


DJS=12DKL(P∥M)+12DKL(Q∥M){displaystyle D_{text{JS}}={frac {1}{2}}D_{text{KL}}(Pparallel M)+{frac {1}{2}}D_{text{KL}}(Qparallel M)}{displaystyle D_{text{JS}}={frac {1}{2}}D_{text{KL}}(Pparallel M)+{frac {1}{2}}D_{text{KL}}(Qparallel M)}

where M{displaystyle M}M is the average of the two distributions,


M=12(P+Q).{displaystyle M={frac {1}{2}}(P+Q).}{displaystyle M={frac {1}{2}}(P+Q).}

DJS{displaystyle D_{JS}}{displaystyle D_{JS}} can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions P{displaystyle P}P and Q{displaystyle Q}Q. The Jensen–Shannon divergence, like all f-divergences, is locally proportional to the Fisher information metric. It is similar to the Hellinger metric (in the sense that induces the same affine connection on a statistical manifold).



Relationship to other probability-distance measures


There are many other important measures of probability distance. Some of these are particularly connected with the Kullback–Leibler divergence. For example:



  • The total variation distance, δ(p,q){displaystyle delta (p,q)}delta (p,q). This is connected to the divergence through Pinsker's inequality: δ(P,Q)≤12DKL(P∥Q){displaystyle delta (P,Q)leq {sqrt {{frac {1}{2}}D_{text{KL}}(Pparallel Q)}}}{displaystyle delta (P,Q)leq {sqrt {{frac {1}{2}}D_{text{KL}}(Pparallel Q)}}}

  • The family of Rényi divergences provide generalizations of the Kullback–Leibler divergence. Depending on the value of a certain parameter, α{displaystyle alpha }alpha , various inequalities may be deduced.


Other notable measures of distance include the Hellinger distance, histogram intersection, Chi-squared statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and earth mover's distance.[26]



Data differencing



Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical background for data differencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target given the source (minimum size of a patch).



See also




  • Akaike information criterion

  • Bayesian information criterion

  • Bregman divergence

  • Cross-entropy

  • Deviance information criterion

  • Entropic value at risk

  • Entropy power inequality

  • Information gain in decision trees

  • Information gain ratio

  • Information theory and measure theory

  • Jensen–Shannon divergence

  • Quantum relative entropy


  • Solomon Kullback and Richard Leibler




References





  1. ^ Kullback, S.; Leibler, R.A. (1951). "On information and sufficiency". Annals of Mathematical Statistics. 22 (1): 79–86. doi:10.1214/aoms/1177729694. MR 0039968..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output q{quotes:"""""""'""'"}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}


  2. ^ abc Kullback, S. (1959), Information Theory and Statistics, John Wiley & Sons. Republished by Dover Publications in 1968; reprinted in 1978:
    ISBN 0-8446-5625-9.



  3. ^ Kullback, S. (1987). "Letter to the Editor: The Kullback–Leibler distance". The American Statistician. 41 (4): 340–341. doi:10.1080/00031305.1987.10475510. JSTOR 2684769.


  4. ^ MacKay, David J.C. (2003). Information Theory, Inference, and Learning Algorithms (First ed.). Cambridge University Press. p. 34.


  5. ^ Bishop C. (2006). Pattern Recognition and Machine Learning


  6. ^ Burnham, K. P.; Anderson, D. R. (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. p. 51.


  7. ^ Hobson, Arthur (1971). Concepts in statistical mechanics. New York: Gordon and Breach. ISBN 0677032404.


  8. ^ Baez, John; Fritz, Tobias (2014). "A Bayesian characterization of relative entropy". Theory and Application of Categories. 29: 421–456. arXiv:1402.3067.


  9. ^ Sanov, I.N. (1957). "On the probability of large deviations of random magnitudes". Matem. Sbornik. 42 (84): 11–44.


  10. ^ Novak S.Y. (2011), Extreme Value Methods with Applications to Finance ch. 14.5 (Chapman & Hall).
    ISBN 978-1-4398-3574-6.



  11. ^ See the section "differential entropy – 4" in Relative Entropy video lecture by Sergio Verdú NIPS 2009


  12. ^ Duchi J., "Derivations for Linear Algebra and Optimization".


  13. ^ Rényi A. (1970). Probability Theory. Elsevier. Appendix, Sec.4. ISBN 0-486-45867-9.


  14. ^ Rényi, A. (1961), "On measures of entropy and information" (PDF), Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960, pp. 547–561


  15. ^ Chaloner, K.; Verdinelli, I. (1995). "Bayesian experimental design: a review". Statistical Science. 10 (3): 273–304. doi:10.1214/ss/1177009939.


  16. ^ Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. (2007). "Section 14.7.2. Kullback–Leibler Distance". Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge University Press. ISBN 978-0-521-88068-8


  17. ^ Thomas M. Cover, Joy A. Thomas (1991) Elements of Information Theory (John Wiley & Sons)


  18. ^ Myron Tribus (1961), Thermodynamics and Thermostatics (D. Van Nostrand, New York)


  19. ^ Jaynes, E. T. (1957). "Information theory and statistical mechanics" (PDF). Physical Review. 106: 620–630. Bibcode:1957PhRv..106..620J. doi:10.1103/physrev.106.620.


  20. ^ Jaynes, E. T. (1957). "Information theory and statistical mechanics II" (PDF). Physical Review. 108: 171–190. Bibcode:1957PhRv..108..171J. doi:10.1103/physrev.108.171.


  21. ^ J.W. Gibbs (1873), "A method of geometrical representation of thermodynamic properties of substances by means of surfaces", reprinted in The Collected Works of J. W. Gibbs, Volume I Thermodynamics, ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green, 1931) footnote page 52.


  22. ^ Tribus, M.; McIrvine, E. C. (1971). "Energy and information". Scientific American. 224: 179–186. Bibcode:1971SciAm.225c.179T. doi:10.1038/scientificamerican0971-179.


  23. ^ Fraundorf, P. (2007). "Thermal roots of correlation-based complexity". Complexity. 13 (3): 18–26. arXiv:1103.2481. Bibcode:2008Cmplx..13c..18F. doi:10.1002/cplx.20195.


  24. ^ Burnham, K.P.; Anderson, D.R. (2001). "Kullback–Leibler information as a basis for strong inference in ecological studies". Wildlife Research. 28: 111–119. doi:10.1071/WR99107.


  25. ^ Burnham, K. P. and Anderson D. R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Second Edition (Springer Science)
    ISBN 978-0-387-95364-9.



  26. ^ Rubner, Y.; Tomasi, C.; Guibas, L. J. (2000). "The earth mover's distance as a metric for image retrieval". International Journal of Computer Vision. 40 (2): 99–121.




External links



  • Information Theoretical Estimators Toolbox

  • Ruby gem for calculating Kullback–Leibler divergence

  • Jon Shlens' tutorial on Kullback–Leibler divergence and likelihood theory

  • Matlab code for calculating Kullback–Leibler divergence for discrete distributions


  • Sergio Verdú, Relative Entropy, NIPS 2009. One-hour video lecture.

  • A modern summary of info-theoretic divergence measures




Comments

Popular posts from this blog

Information security

章鱼与海女图

Farm Security Administration