Variational Bayesian methods









Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:



  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.

  2. To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.)


In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods — particularly, Markov chain Monte Carlo methods such as Gibbs sampling — for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to directly evaluate or sample from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.


Variational Bayes can be seen as an extension of the EM (expectation-maximization) algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.


For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.




Contents






  • 1 Mathematical derivation


  • 2 Mean field approximation


  • 3 A basic example


    • 3.1 The mathematical model


    • 3.2 The joint probability


    • 3.3 Factorized approximation


    • 3.4 Derivation of q(μ)


    • 3.5 Derivation of q(τ){displaystyle q(tau )}{displaystyle q(tau )}


    • 3.6 Algorithm for computing the parameters




  • 4 Further discussion


    • 4.1 Step-by-step recipe


    • 4.2 Most important points


    • 4.3 Compared with expectation maximization (EM)




  • 5 A more complex example


  • 6 Exponential-family distributions


  • 7 See also


  • 8 Notes


  • 9 References


  • 10 External links





Mathematical derivation


In variational inference, the posterior distribution over a set of unobserved variables Z={Z1…Zn}{displaystyle mathbf {Z} ={Z_{1}dots Z_{n}}}{mathbf  {Z}}={Z_{1}dots Z_{n}} given some data X{displaystyle mathbf {X} }mathbf {X} is approximated
by a variational distribution, Q(Z){displaystyle Q(mathbf {Z} )}Q({mathbf  {Z}}):


P(Z∣X)≈Q(Z).{displaystyle P(mathbf {Z} mid mathbf {X} )approx Q(mathbf {Z} ).}P({mathbf  {Z}}mid {mathbf  {X}})approx Q({mathbf  {Z}}).

The distribution Q(Z){displaystyle Q(mathbf {Z} )}Q({mathbf  {Z}}) is restricted to belong to a family of distributions of simpler
form than P(Z∣X){displaystyle P(mathbf {Z} mid mathbf {X} )}P({mathbf  {Z}}mid {mathbf  {X}}), selected with the intention of making Q(Z){displaystyle Q(mathbf {Z} )}Q({mathbf  {Z}}) similar to the true posterior, P(Z∣X){displaystyle P(mathbf {Z} mid mathbf {X} )}P({mathbf  {Z}}mid {mathbf  {X}}). The lack of similarity is measured in terms of
a dissimilarity function d(Q;P){displaystyle d(Q;P)}d(Q;P) and hence inference is performed by selecting the distribution
Q(Z){displaystyle Q(mathbf {Z} )}Q({mathbf  {Z}}) that minimizes d(Q;P){displaystyle d(Q;P)}d(Q;P).


The most common type of variational Bayes uses the Kullback–Leibler divergence (KL-divergence) of P from Q as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as


DKL(Q||P)≜ZQ(Z)log⁡Q(Z)P(Z∣X).{displaystyle D_{mathrm {KL} }(Q||P)triangleq sum _{mathbf {Z} }Q(mathbf {Z} )log {frac {Q(mathbf {Z} )}{P(mathbf {Z} mid mathbf {X} )}}.}{displaystyle D_{mathrm {KL} }(Q||P)triangleq sum _{mathbf {Z} }Q(mathbf {Z} )log {frac {Q(mathbf {Z} )}{P(mathbf {Z} mid mathbf {X} )}}.}

Note that Q and P are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation-maximization algorithm. (Using the KL-divergence in the other way produces the expectation propagation algorithm.)


Variational techniques are typically used to form an approximation for:


P(Z∣X)=P(X∣Z)P(Z)P(X)=P(X∣Z)P(Z)∫ZP(X,Z)dZ{displaystyle P(mathbf {Z} mid mathbf {X} )={frac {P(mathbf {X} mid mathbf {Z} )P(mathbf {Z} )}{P(mathbf {X} )}}={frac {P(mathbf {X} mid mathbf {Z} )P(mathbf {Z} )}{int _{mathbf {Z} }P(mathbf {X} ,mathbf {Z} ),dmathbf {Z} }}}{displaystyle P(mathbf {Z} mid mathbf {X} )={frac {P(mathbf {X} mid mathbf {Z} )P(mathbf {Z} )}{P(mathbf {X} )}}={frac {P(mathbf {X} mid mathbf {Z} )P(mathbf {Z} )}{int _{mathbf {Z} }P(mathbf {X} ,mathbf {Z} ),dmathbf {Z} }}}

The marginalization over Z{displaystyle mathbf {Z} }{mathbf  Z} to calculate P(X){displaystyle P(mathbf {X} )}{displaystyle P(mathbf {X} )} in the denominator is typically intractable. For example because the search space of Z{displaystyle mathbf {Z} }{mathbf  Z} is combinatorially large. Therefore, we seek an approximation, using Q(Z)≈P(Z∣X){displaystyle Q(mathbf {Z} )approx P(mathbf {Z} mid mathbf {X} )}{displaystyle Q(mathbf {Z} )approx P(mathbf {Z} mid mathbf {X} )}.


The KL-divergence can be written as


DKL(Q||P)=∑ZQ(Z)[log⁡Q(Z)P(Z,X)+log⁡P(X)],{displaystyle D_{mathrm {KL} }(Q||P)=sum _{mathbf {Z} }Q(mathbf {Z} )left[log {frac {Q(mathbf {Z} )}{P(mathbf {Z} ,mathbf {X} )}}+log P(mathbf {X} )right],}{displaystyle D_{mathrm {KL} }(Q||P)=sum _{mathbf {Z} }Q(mathbf {Z} )left[log {frac {Q(mathbf {Z} )}{P(mathbf {Z} ,mathbf {X} )}}+log P(mathbf {X} )right],}

or


log⁡P(X)=DKL(Q||P)−ZQ(Z)log⁡Q(Z)P(Z,X)=DKL(Q||P)+L(Q).{displaystyle log P(mathbf {X} )=D_{mathrm {KL} }(Q||P)-sum _{mathbf {Z} }Q(mathbf {Z} )log {frac {Q(mathbf {Z} )}{P(mathbf {Z} ,mathbf {X} )}}=D_{mathrm {KL} }(Q||P)+{mathcal {L}}(Q).}log P({mathbf  {X}})=D_{{{mathrm  {KL}}}}(Q||P)-sum _{{mathbf  {Z}}}Q({mathbf  {Z}})log {frac  {Q({mathbf  {Z}})}{P({mathbf  {Z}},{mathbf  {X}})}}=D_{{{mathrm  {KL}}}}(Q||P)+{mathcal  {L}}(Q).

As the log evidence log⁡P(X){displaystyle log P(mathbf {X} )}log P({mathbf  {X}}) is fixed with respect to Q{displaystyle Q}Q, maximizing the final term L(Q){displaystyle {mathcal {L}}(Q)}{mathcal  {L}}(Q) minimizes the KL divergence of P{displaystyle P}P from Q{displaystyle Q}Q. By appropriate choice of Q{displaystyle Q}Q, L(Q){displaystyle {mathcal {L}}(Q)}{mathcal  {L}}(Q) becomes tractable to compute and to maximize. Hence we have both an analytical approximation Q{displaystyle Q}Q for the posterior P(Z∣X){displaystyle P(mathbf {Z} mid mathbf {X} )}P({mathbf  {Z}}mid {mathbf  {X}}), and a lower bound L(Q){displaystyle {mathcal {L}}(Q)}{mathcal  {L}}(Q) for the evidence log⁡P(X){displaystyle log P(mathbf {X} )}log P({mathbf  {X}}). The lower bound L(Q){displaystyle {mathcal {L}}(Q)}{mathcal  {L}}(Q) is known as the (negative) variational free energy in analogy with thermodynamic free energy because it can also be expressed as an "energy" EQ⁡[log⁡P(Z,X)]{displaystyle operatorname {E} _{Q}[log P(mathbf {Z} ,mathbf {X} )]}operatorname {E}_{{Q}}[log P({mathbf  {Z}},{mathbf  {X}})] plus the entropy of Q{displaystyle Q}Q.


By generalized Pythagorean theorem of Bregman divergence, of which KL-divergence is a special case, it can be shown that [1][2]:




Generalized Pythagorean theorem for Bregman divergence [2].


DKL(Q||P)≥DKL(Q||Q∗)+DKL(Q∗||P),∀Q∈C{displaystyle D_{mathrm {KL} }(Q||P)geq D_{mathrm {KL} }(Q||Q^{*})+D_{mathrm {KL} }(Q^{*}||P),forall Qin {mathcal {C}}}{displaystyle D_{mathrm {KL} }(Q||P)geq D_{mathrm {KL} }(Q||Q^{*})+D_{mathrm {KL} }(Q^{*}||P),forall Qin {mathcal {C}}}

where C{displaystyle {mathcal {C}}}{mathcal {C}} is a convex set and the equality holds if:


Q=Q∗arg⁡minQ∈CDKL(Q||P).{displaystyle Q=Q^{*}triangleq arg min _{Qin {mathcal {C}}}D_{mathrm {KL} }(Q||P).}{displaystyle Q=Q^{*}triangleq arg min _{Qin {mathcal {C}}}D_{mathrm {KL} }(Q||P).}

In this case, the global minimizer Q∗(Z)=q∗(Z1|Z2)q∗(Z2)=q∗(Z2|Z1)q∗(Z1),{displaystyle Q^{*}(mathbf {Z} )=q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})q^{*}(mathbf {Z} _{2})=q^{*}(mathbf {Z} _{2}|mathbf {Z} _{1})q^{*}(mathbf {Z} _{1}),}{displaystyle Q^{*}(mathbf {Z} )=q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})q^{*}(mathbf {Z} _{2})=q^{*}(mathbf {Z} _{2}|mathbf {Z} _{1})q^{*}(mathbf {Z} _{1}),}with Z={Z1,Z2},{displaystyle mathbf {Z} ={mathbf {Z_{1}} ,mathbf {Z_{2}} },}{displaystyle mathbf {Z} ={mathbf {Z_{1}} ,mathbf {Z_{2}} },} can be found as follows [1]:


q∗(Z2)=P(X)ζ(X)P(Z2|X)exp⁡(DKL(q∗(Z1|Z2)||P(Z1|Z2,X)))=1ζ(X)exp⁡Eq∗(Z1|Z2)(log⁡P(Z,X)q∗(Z1|Z2)),{displaystyle q^{*}(mathbf {Z} _{2})={frac {P(mathbf {X} )}{zeta (mathbf {X} )}}{frac {P(mathbf {Z} _{2}|mathbf {X} )}{exp(D_{mathrm {KL} }(q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})||P(mathbf {Z} _{1}|mathbf {Z} _{2},mathbf {X} )))}}={frac {1}{zeta (mathbf {X} )}}exp mathbb {E} _{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}left(log {frac {P(mathbf {Z} ,mathbf {X} )}{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}}right),}{displaystyle q^{*}(mathbf {Z} _{2})={frac {P(mathbf {X} )}{zeta (mathbf {X} )}}{frac {P(mathbf {Z} _{2}|mathbf {X} )}{exp(D_{mathrm {KL} }(q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})||P(mathbf {Z} _{1}|mathbf {Z} _{2},mathbf {X} )))}}={frac {1}{zeta (mathbf {X} )}}exp mathbb {E} _{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}left(log {frac {P(mathbf {Z} ,mathbf {X} )}{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}}right),}

in which the normalizing constant is:


ζ(X)=P(X)∫Z2P(Z2|X)exp⁡(DKL(q∗(Z1|Z2)||P(Z1|Z2,X)))=∫Z2exp⁡Eq∗(Z1|Z2)(log⁡P(Z,X)q∗(Z1|Z2)).{displaystyle zeta (mathbf {X} )=P(mathbf {X} )int _{mathbf {Z} _{2}}{frac {P(mathbf {Z} _{2}|mathbf {X} )}{exp(D_{mathrm {KL} }(q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})||P(mathbf {Z} _{1}|mathbf {Z} _{2},mathbf {X} )))}}=int _{mathbf {Z} _{2}}exp mathbb {E} _{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}left(log {frac {P(mathbf {Z} ,mathbf {X} )}{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}}right).}{displaystyle zeta (mathbf {X} )=P(mathbf {X} )int _{mathbf {Z} _{2}}{frac {P(mathbf {Z} _{2}|mathbf {X} )}{exp(D_{mathrm {KL} }(q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})||P(mathbf {Z} _{1}|mathbf {Z} _{2},mathbf {X} )))}}=int _{mathbf {Z} _{2}}exp mathbb {E} _{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}left(log {frac {P(mathbf {Z} ,mathbf {X} )}{q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})}}right).}

The term ζ(X){displaystyle zeta (mathbf {X} )}{displaystyle zeta (mathbf {X} )} is often called the evidence lower bound (ELBO) in practice, since P(X)≥ζ(X)=exp⁡(L(Q∗)){displaystyle P(mathbf {X} )geq zeta (mathbf {X} )=exp({mathcal {L}}(Q^{*}))}{displaystyle P(mathbf {X} )geq zeta (mathbf {X} )=exp({mathcal {L}}(Q^{*}))}[1], as shown above.


By interchanging the roles of Z1{displaystyle mathbf {Z} _{1}}{displaystyle mathbf {Z} _{1}} and Z2,{displaystyle mathbf {Z} _{2},}{displaystyle mathbf {Z} _{2},} we can iteratively compute the approximated q∗(Z1){displaystyle q^{*}(mathbf {Z} _{1})}{displaystyle q^{*}(mathbf {Z} _{1})} and q∗(Z2){displaystyle q^{*}(mathbf {Z} _{2})}{displaystyle q^{*}(mathbf {Z} _{2})} of the true model's marginals P(Z1|X){displaystyle P(mathbf {Z} _{1}|mathbf {X} )}{displaystyle P(mathbf {Z} _{1}|mathbf {X} )} and P(Z2|X),{displaystyle P(mathbf {Z} _{2}|mathbf {X} ),}{displaystyle P(mathbf {Z} _{2}|mathbf {X} ),} respectively. Although this iterative scheme is guaranteed to converge monotonically [1], the converged Q∗{displaystyle Q^{*}}{displaystyle Q^{*}} is only a local minimizer of DKL(Q||P){displaystyle D_{mathrm {KL} }(Q||P)}{displaystyle D_{mathrm {KL} }(Q||P)}.


If the constrained space C{displaystyle {mathcal {C}}}{mathcal {C}} is confined within independent space, i.e. q∗(Z1|Z2)=q∗(Z1),{displaystyle q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})=q^{*}(mathbf {Z_{1}} ),}{displaystyle q^{*}(mathbf {Z} _{1}|mathbf {Z} _{2})=q^{*}(mathbf {Z_{1}} ),}the above iterative scheme will become the so-called mean field approximation Q∗(Z)=q∗(Z1)q∗(Z2),{displaystyle Q^{*}(mathbf {Z} )=q^{*}(mathbf {Z} _{1})q^{*}(mathbf {Z} _{2}),}{displaystyle Q^{*}(mathbf {Z} )=q^{*}(mathbf {Z} _{1})q^{*}(mathbf {Z} _{2}),}as shown below.



Mean field approximation


The variational distribution Q(Z){displaystyle Q(mathbf {Z} )}Q({mathbf  {Z}}) is usually assumed to factorize over some partition of the latent variables, i.e. for some partition of the latent variables Z{displaystyle mathbf {Z} }mathbf {Z} into Z1…ZM{displaystyle mathbf {Z} _{1}dots mathbf {Z} _{M}}{mathbf  {Z}}_{1}dots {mathbf  {Z}}_{M},


Q(Z)=∏i=1Mqi(Zi∣X){displaystyle Q(mathbf {Z} )=prod _{i=1}^{M}q_{i}(mathbf {Z} _{i}mid mathbf {X} )}Q({mathbf  {Z}})=prod _{{i=1}}^{M}q_{i}({mathbf  {Z}}_{i}mid {mathbf  {X}})

It can be shown using the calculus of variations (hence the name "variational Bayes") that the "best" distribution qj∗{displaystyle q_{j}^{*}}q_{j}^{{*}} for each of the factors qj{displaystyle q_{j}}q_{j} (in terms of the distribution minimizing the KL divergence, as described above) can be expressed as:


qj∗(Zj∣X)=eEi≠j⁡[ln⁡p(Z,X)]∫eEi≠j⁡[ln⁡p(Z,X)]dZj{displaystyle q_{j}^{*}(mathbf {Z} _{j}mid mathbf {X} )={frac {e^{operatorname {E} _{ineq j}[ln p(mathbf {Z} ,mathbf {X} )]}}{int e^{operatorname {E} _{ineq j}[ln p(mathbf {Z} ,mathbf {X} )]},dmathbf {Z} _{j}}}}q_{j}^{{*}}({mathbf  {Z}}_{j}mid {mathbf  {X}})={frac  {e^{{operatorname {E}_{{ineq j}}[ln p({mathbf  {Z}},{mathbf  {X}})]}}}{int e^{{operatorname {E}_{{ineq j}}[ln p({mathbf  {Z}},{mathbf  {X}})]}},d{mathbf  {Z}}_{j}}}

where Ei≠j⁡[ln⁡p(Z,X)]{displaystyle operatorname {E} _{ineq j}[ln p(mathbf {Z} ,mathbf {X} )]}operatorname {E}_{{ineq j}}[ln p({mathbf  {Z}},{mathbf  {X}})] is the expectation of the logarithm of the joint probability of the data and latent variables, taken over all variables not in the partition.


In practice, we usually work in terms of logarithms, i.e.:


ln⁡qj∗(Zj∣X)=Ei≠j⁡[ln⁡p(Z,X)]+constant{displaystyle ln q_{j}^{*}(mathbf {Z} _{j}mid mathbf {X} )=operatorname {E} _{ineq j}[ln p(mathbf {Z} ,mathbf {X} )]+{text{constant}}}ln q_{j}^{{*}}({mathbf  {Z}}_{j}mid {mathbf  {X}})=operatorname {E}_{{ineq j}}[ln p({mathbf  {Z}},{mathbf  {X}})]+{text{constant}}

The constant in the above expression is related to the normalizing constant (the denominator in the expression above for qj∗{displaystyle q_{j}^{*}}q_{j}^{{*}}) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g. Gaussian, gamma, etc.).


Using the properties of expectations, the expression Ei≠j⁡[ln⁡p(Z,X)]{displaystyle operatorname {E} _{ineq j}[ln p(mathbf {Z} ,mathbf {X} )]}operatorname {E}_{{ineq j}}[ln p({mathbf  {Z}},{mathbf  {X}})] can usually be simplified into a function of the fixed hyperparameters of the prior distributions over the latent variables and of expectations (and sometimes higher moments such as the variance) of latent variables not in the current partition (i.e. latent variables not included in Zj{displaystyle mathbf {Z} _{j}}{mathbf  {Z}}_{j}). This creates circular dependencies between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an iterative algorithm, much like EM (the expectation-maximization algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to converge.[3]


In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the means); sometimes expectations of squared variables (which can be related to the variance of the variables), or expectations of higher powers (i.e. higher moments) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, nonlinear dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.



A basic example


Consider a simple non-hierarchical Bayesian model consisting of a set of i.i.d. observations from a Gaussian distribution, with unknown mean and variance.[4] In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.


For mathematical convenience, in the following example we work in terms of the precision — i.e. the reciprocal of the variance (or in a multivariate Gaussian, the inverse of the covariance matrix) — rather than the variance itself. (From a theoretical standpoint, precision and variance are equivalent since there is a one-to-one correspondence between the two.)



The mathematical model


We place conjugate prior distributions on the unknown mean and variance, i.e. the mean also follows a Gaussian distribution while the precision follows a gamma distribution. In other words:


τGamma⁡(a0,b0)μN(μ0,(λ)−1){x1,…,xN}∼N(μ1)N=number of data points{displaystyle {begin{aligned}tau &sim operatorname {Gamma} (a_{0},b_{0})\mu &sim {mathcal {N}}(mu _{0},(lambda _{0}tau )^{-1})\{x_{1},dots ,x_{N}}&sim {mathcal {N}}(mu ,tau ^{-1})\N&={text{number of data points}}end{aligned}}}{displaystyle {begin{aligned}tau &sim operatorname {Gamma} (a_{0},b_{0})\mu &sim {mathcal {N}}(mu _{0},(lambda _{0}tau )^{-1})\{x_{1},dots ,x_{N}}&sim {mathcal {N}}(mu ,tau ^{-1})\N&={text{number of data points}}end{aligned}}}

We are given N{displaystyle N}N data points X={x1,⋯,xN}{displaystyle mathbf {X} ={x_{1},cdots ,x_{N}}}mathbf{X} = {x_1, cdots, x_N} and our goal is to infer the posterior distribution q(μ)=p(μx1,…,xN){displaystyle q(mu ,tau )=p(mu ,tau mid x_{1},ldots ,x_{N})}q(mu ,tau )=p(mu ,tau mid x_{1},ldots ,x_{N}) of the parameters μ{displaystyle mu }mu and τ.{displaystyle tau .}tau .


The hyperparameters μ0,λ0,a0{displaystyle mu _{0},lambda _{0},a_{0}}mu_0, lambda_0, a_0 and b0{displaystyle b_{0}}b_{0} are fixed, given values. They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions of μ{displaystyle mu }mu and τ{displaystyle tau }tau .



The joint probability


The joint probability of all variables can be rewritten as


p(X,μ)=p(X∣μ)p(μτ)p(τ){displaystyle p(mathbf {X} ,mu ,tau )=p(mathbf {X} mid mu ,tau )p(mu mid tau )p(tau )}p({mathbf  {X}},mu ,tau )=p({mathbf  {X}}mid mu ,tau )p(mu mid tau )p(tau )

where the individual factors are


p(X∣μ)=∏n=1NN(xn∣μ1)p(μτ)=N(μμ0,(λ)−1)p(τ)=Gamma⁡a0,b0){displaystyle {begin{aligned}p(mathbf {X} mid mu ,tau )&=prod _{n=1}^{N}{mathcal {N}}(x_{n}mid mu ,tau ^{-1})\p(mu mid tau )&={mathcal {N}}left(mu mid mu _{0},(lambda _{0}tau )^{-1}right)\p(tau )&=operatorname {Gamma} (tau mid a_{0},b_{0})end{aligned}}}<br />
begin{align}<br />
p(mathbf{X}mid mu,tau) & = prod_{n=1}^N mathcal{N}(x_nmid mu,tau^{-1}) \<br />
p(mumid tau) & = mathcal{N} left (mumid mu_0, (lambda_0 tau)^{-1} right ) \<br />
p(tau) & = operatorname{Gamma}(taumid a_0, b_0)<br />
end{align}<br />

where


N(x∣μ2)=12πσ2e−(x−μ)22σ2Gamma⁡a,b)=1Γ(a)baτa−1e−{displaystyle {begin{aligned}{mathcal {N}}(xmid mu ,sigma ^{2})&={frac {1}{sqrt {2pi sigma ^{2}}}}e^{frac {-(x-mu )^{2}}{2sigma ^{2}}}\operatorname {Gamma} (tau mid a,b)&={frac {1}{Gamma (a)}}b^{a}tau ^{a-1}e^{-btau }end{aligned}}}{begin{aligned}{mathcal  {N}}(xmid mu ,sigma ^{2})&={frac  {1}{{sqrt  {2pi sigma ^{2}}}}}e^{{{frac  {-(x-mu )^{2}}{2sigma ^{2}}}}}\operatorname {Gamma}(tau mid a,b)&={frac  {1}{Gamma (a)}}b^{a}tau ^{{a-1}}e^{{-btau }}end{aligned}}


Factorized approximation


Assume that q(μ)=q(μ)q(τ){displaystyle q(mu ,tau )=q(mu )q(tau )}q(mu ,tau )=q(mu )q(tau ), i.e. that the posterior distribution factorizes into independent factors for μ{displaystyle mu }mu and τ{displaystyle tau }tau . This type of assumption underlies the variational Bayesian method. The true posterior distribution does not in fact factor this way (in fact, in this simple case, it is known to be a Gaussian-gamma distribution), and hence the result we obtain will be an approximation.



Derivation of q(μ)


Then


ln⁡)=Eτ[ln⁡p(X∣μ)+ln⁡p(μτ)+ln⁡p(τ)]+C=Eτ[ln⁡p(X∣μ)]+Eτ[ln⁡p(μτ)]+Eτ[ln⁡p(τ)]+C=Eτ[ln⁡n=1NN(xn∣μ1)]+Eτ[ln⁡N(μμ0,(λ)−1)]+C2=Eτ[ln⁡n=1Nτe−(xn−μ)2τ2]+Eτ[ln⁡λe−μ0)2λ2]+C2=Eτ[∑n=1N(12(ln⁡τln⁡)−(xn−μ)2τ2)]+Eτ[12(ln⁡λ0+ln⁡τln⁡)−μ0)2λ2]+C2=Eτ[∑n=1N−(xn−μ)2τ2]+Eτ[−μ0)2λ2]+Eτ[∑n=1N12(ln⁡τln⁡)]+Eτ[12(ln⁡λ0+ln⁡τln⁡)]+C2=Eτ[∑n=1N−(xn−μ)2τ2]+Eτ[−μ0)2λ2]+C3=−]2{∑n=1N(xn−μ)2+λ0(μμ0)2}+C3{displaystyle {begin{aligned}ln q_{mu }^{*}(mu )&=operatorname {E} _{tau }left[ln p(mathbf {X} mid mu ,tau )+ln p(mu mid tau )+ln p(tau )right]+C\&=operatorname {E} _{tau }left[ln p(mathbf {X} mid mu ,tau )right]+operatorname {E} _{tau }left[ln p(mu mid tau )right]+operatorname {E} _{tau }left[ln p(tau )right]+C\&=operatorname {E} _{tau }left[ln prod _{n=1}^{N}{mathcal {N}}left(x_{n}mid mu ,tau ^{-1}right)right]+operatorname {E} _{tau }left[ln {mathcal {N}}left(mu mid mu _{0},(lambda _{0}tau )^{-1}right)right]+C_{2}\&=operatorname {E} _{tau }left[ln prod _{n=1}^{N}{sqrt {frac {tau }{2pi }}}e^{-{frac {(x_{n}-mu )^{2}tau }{2}}}right]+operatorname {E} _{tau }left[ln {sqrt {frac {lambda _{0}tau }{2pi }}}e^{-{frac {(mu -mu _{0})^{2}lambda _{0}tau }{2}}}right]+C_{2}\&=operatorname {E} _{tau }left[sum _{n=1}^{N}left({frac {1}{2}}(ln tau -ln 2pi )-{frac {(x_{n}-mu )^{2}tau }{2}}right)right]+operatorname {E} _{tau }left[{frac {1}{2}}(ln lambda _{0}+ln tau -ln 2pi )-{frac {(mu -mu _{0})^{2}lambda _{0}tau }{2}}right]+C_{2}\&=operatorname {E} _{tau }left[sum _{n=1}^{N}-{frac {(x_{n}-mu )^{2}tau }{2}}right]+operatorname {E} _{tau }left[-{frac {(mu -mu _{0})^{2}lambda _{0}tau }{2}}right]+operatorname {E} _{tau }left[sum _{n=1}^{N}{frac {1}{2}}(ln tau -ln 2pi )right]+operatorname {E} _{tau }left[{frac {1}{2}}(ln lambda _{0}+ln tau -ln 2pi )right]+C_{2}\&=operatorname {E} _{tau }left[sum _{n=1}^{N}-{frac {(x_{n}-mu )^{2}tau }{2}}right]+operatorname {E} _{tau }left[-{frac {(mu -mu _{0})^{2}lambda _{0}tau }{2}}right]+C_{3}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{sum _{n=1}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right}+C_{3}end{aligned}}}<br />
begin{align}<br />
ln q_mu^*(mu) &= operatorname{E}_{tau}left[ln p(mathbf{X}mid mu,tau) + ln p(mumid tau) + ln p(tau)right] + C \<br />
 &= operatorname{E}_{tau}left[ln p(mathbf{X}mid mu,tau)right] + operatorname{E}_{tau}left[ln p(mumid tau)right] + operatorname{E}_{tau}left[ln p(tau)right] + C \<br />
 &= operatorname{E}_{tau}left[ln prod_{n=1}^N mathcal{N} left (x_nmid mu,tau^{-1} right )right] + operatorname{E}_{tau}left[ln mathcal{N} left (mumid mu_0, (lambda_0 tau)^{-1} right )right] + C_2 \<br />
 &= operatorname{E}_{tau}left[ln prod_{n=1}^N sqrt{frac{tau}{2pi}} e^{-frac{(x_n-mu)^2tau}{2}}right] + operatorname{E}_{tau}left[ln sqrt{frac{lambda_0 tau}{2pi}} e^{-frac{(mu-mu_0)^2lambda_0 tau}{2}}right] + C_2 \<br />
 &= operatorname{E}_{tau}left[sum_{n=1}^N left(frac{1}{2}(lntau - ln 2pi) - frac{(x_n-mu)^2tau}{2}right)right] + operatorname{E}_{tau}left[frac{1}{2}(ln lambda_0 + ln tau - ln 2pi) - frac{(mu-mu_0)^2lambda_0 tau}{2}right] + C_2 \<br />
 &= operatorname{E}_{tau}left[sum_{n=1}^N -frac{(x_n-mu)^2tau}{2}right] + operatorname{E}_{tau}left[-frac{(mu-mu_0)^2lambda_0 tau}{2}right] + operatorname{E}_{tau}left[sum_{n=1}^N frac{1}{2}(lntau - ln 2pi)right] + operatorname{E}_{tau}left[frac{1}{2}(ln lambda_0 + ln tau - ln 2pi)right] + C_2 \<br />
 &= operatorname{E}_{tau}left[sum_{n=1}^N -frac{(x_n-mu)^2tau}{2}right] + operatorname{E}_{tau}left[-frac{(mu-mu_0)^2lambda_0 tau}{2}right] + C_3 \<br />
 &= - frac{operatorname{E}_{tau}[tau]}{2} left{ sum_{n=1}^N (x_n-mu)^2 + lambda_0(mu-mu_0)^2 right} + C_3<br />
end{align}<br />

In the above derivation, C{displaystyle C}C, C2{displaystyle C_{2}}C_{2} and C3{displaystyle C_{3}}C_{3} refer to values that are constant with respect to μ{displaystyle mu }mu . Note that the term [ln⁡p(τ)]{displaystyle operatorname {E} _{tau }[ln p(tau )]}operatorname {E}_{{tau }}[ln p(tau )] is not a function of μ{displaystyle mu }mu and will have the same value regardless of the value of μ{displaystyle mu }mu . Hence in line 3 we can absorb it into the constant term at the end. We do the same thing in line 7.


The last line is simply a quadratic polynomial in μ{displaystyle mu }mu . Since this is the logarithm of ){displaystyle q_{mu }^{*}(mu )}q_{mu }^{*}(mu ), we can see that ){displaystyle q_{mu }^{*}(mu )}q_{mu }^{*}(mu ) itself is a Gaussian distribution.


With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involving μ{displaystyle mu }mu and μ2{displaystyle mu ^{2}}mu ^{2} and completing the square over μ{displaystyle mu }mu ), we can derive the parameters of the Gaussian distribution:


ln⁡)=−]2{∑n=1N(xn−μ)2+λ0(μμ0)2}+C3=−]2{∑n=1N(xn2−2xnμ2)+λ0(μ2−02)}+C3=−]2{(∑n=1Nxn2)−2(∑n=1Nxn)μ+(∑n=1Nμ2)+λ2−02}+C3=−]2{(λ0+N)μ2−2(λ0+∑n=1Nxn)μ+(∑n=1Nxn2)+λ02}+C3=−]2{(λ0+N)μ2−2(λ0+∑n=1Nxn)μ}+C4=−]2{(λ0+N)μ2−2(λ0+∑n=1Nxnλ0+N)(λ0+N)μ}+C4=−]2{(λ0+N)(μ2−2(λ0+∑n=1Nxnλ0+N)μ)}+C4=−]2{(λ0+N)(μ2−2(λ0+∑n=1Nxnλ0+N)μ+(λ0+∑n=1Nxnλ0+N)2−0+∑n=1Nxnλ0+N)2)}+C4=−]2{(λ0+N)(μ2−2(λ0+∑n=1Nxnλ0+N)μ+(λ0+∑n=1Nxnλ0+N)2)}+C5=−]2{(λ0+N)(μλ0+∑n=1Nxnλ0+N)2}+C5=−12(λ0+N)Eτ](μλ0+∑n=1Nxnλ0+N)2+C5{displaystyle {begin{aligned}ln q_{mu }^{*}(mu )&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{sum _{n=1}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right}+C_{3}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{sum _{n=1}^{N}(x_{n}^{2}-2x_{n}mu +mu ^{2})+lambda _{0}(mu ^{2}-2mu _{0}mu +mu _{0}^{2})right}+C_{3}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{left(sum _{n=1}^{N}x_{n}^{2}right)-2left(sum _{n=1}^{N}x_{n}right)mu +left(sum _{n=1}^{N}mu ^{2}right)+lambda _{0}mu ^{2}-2lambda _{0}mu _{0}mu +lambda _{0}mu _{0}^{2}right}+C_{3}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)mu ^{2}-2left(lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}right)mu +left(sum _{n=1}^{N}x_{n}^{2}right)+lambda _{0}mu _{0}^{2}right}+C_{3}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)mu ^{2}-2left(lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}right)mu right}+C_{4}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)mu ^{2}-2left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)(lambda _{0}+N)mu right}+C_{4}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)left(mu ^{2}-2left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)mu right)right}+C_{4}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)left(mu ^{2}-2left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)mu +left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)^{2}-left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)^{2}right)right}+C_{4}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)left(mu ^{2}-2left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)mu +left({frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)^{2}right)right}+C_{5}\&=-{frac {operatorname {E} _{tau }[tau ]}{2}}left{(lambda _{0}+N)left(mu -{frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)^{2}right}+C_{5}\&=-{frac {1}{2}}(lambda _{0}+N)operatorname {E} _{tau }[tau ]left(mu -{frac {lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}}{lambda _{0}+N}}right)^{2}+C_{5}end{aligned}}}begin{align}<br />
ln q_mu^*(mu) &= -frac{operatorname{E}_{tau}[tau]}{2} left{ sum_{n=1}^N (x_n-mu)^2 + lambda_0(mu-mu_0)^2 right} + C_3 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ sum_{n=1}^N (x_n^2-2x_nmu + mu^2) + lambda_0(mu^2-2mu_0mu + mu_0^2) right } + C_3 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ left(sum_{n=1}^N x_n^2right)-2left(sum_{n=1}^N x_nright)mu + left ( sum_{n=1}^N mu^2 right) + lambda_0mu^2-2lambda_0mu_0mu + lambda_0mu_0^2 right} + C_3 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)mu^2 -2left(lambda_0mu_0 + sum_{n=1}^N x_nright)mu + left(sum_{n=1}^N x_n^2right) + lambda_0mu_0^2 right} + C_3 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)mu^2 -2left(lambda_0mu_0 + sum_{n=1}^N x_nright)mu right} + C_4 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)mu^2 -2left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N} right)(lambda_0+N) mu right} + C_4 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)left(mu^2 -2left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right) muright) right} + C_4 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)left(mu^2 -2left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right) mu + left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right)^2 - left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right)^2right) right} + C_4 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)left(mu^2 -2left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right) mu + left(frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right)^2 right) right} + C_5 \<br />
                 &= -frac{operatorname{E}_{tau}[tau]}{2} left{ (lambda_0+N)left(mu-frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right)^2 right} + C_5 \<br />
                 &= -frac{1}{2} (lambda_0+N)operatorname{E}_{tau}[tau] left(mu-frac{lambda_0mu_0 + sum_{n=1}^N x_n}{lambda_0+N}right)^2 + C_5<br />
end{align}

Note that all of the above steps can be shortened by using the formula for the sum of two quadratics.


In other words:


)∼N(μμN,λN−1)μN=λ0+Nx¯λ0+NλN=(λ0+N)Eτ]x¯=1N∑n=1Nxn{displaystyle {begin{aligned}q_{mu }^{*}(mu )&sim {mathcal {N}}(mu mid mu _{N},lambda _{N}^{-1})\mu _{N}&={frac {lambda _{0}mu _{0}+N{bar {x}}}{lambda _{0}+N}}\lambda _{N}&=(lambda _{0}+N)operatorname {E} _{tau }[tau ]\{bar {x}}&={frac {1}{N}}sum _{n=1}^{N}x_{n}end{aligned}}}{displaystyle {begin{aligned}q_{mu }^{*}(mu )&sim {mathcal {N}}(mu mid mu _{N},lambda _{N}^{-1})\mu _{N}&={frac {lambda _{0}mu _{0}+N{bar {x}}}{lambda _{0}+N}}\lambda _{N}&=(lambda _{0}+N)operatorname {E} _{tau }[tau ]\{bar {x}}&={frac {1}{N}}sum _{n=1}^{N}x_{n}end{aligned}}}


Derivation of q(τ){displaystyle q(tau )}{displaystyle q(tau )}


The derivation of ){displaystyle q_{tau }^{*}(tau )}q_{tau }^{*}(tau ) is similar to above, although we omit some of the details for the sake of brevity.


ln⁡)=Eμ[ln⁡p(X∣μ)+ln⁡p(μτ)]+ln⁡p(τ)+constant=(a0−1)ln⁡τb0τ+12ln⁡τ+N2ln⁡ττ2Eμ[∑n=1N(xn−μ)2+λ0(μμ0)2]+constant{displaystyle {begin{aligned}ln q_{tau }^{*}(tau )&=operatorname {E} _{mu }[ln p(mathbf {X} mid mu ,tau )+ln p(mu mid tau )]+ln p(tau )+{text{constant}}\&=(a_{0}-1)ln tau -b_{0}tau +{frac {1}{2}}ln tau +{frac {N}{2}}ln tau -{frac {tau }{2}}operatorname {E} _{mu }left[sum _{n=1}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right]+{text{constant}}end{aligned}}}<br />
begin{align}<br />
ln q_tau^*(tau) &= operatorname{E}_{mu}[ln p(mathbf{X}mid mu,tau) + ln p(mumid tau)] + ln p(tau) + text{constant} \<br />
                   &= (a_0 - 1) ln tau - b_0 tau + frac{1}{2} ln tau + frac{N}{2} ln tau - frac{tau}{2} operatorname{E}_mu left [ sum_{n=1}^N (x_n-mu)^2 + lambda_0(mu - mu_0)^2 right ] + text{constant}<br />
end{align}<br />

Exponentiating both sides, we can see that ){displaystyle q_{tau }^{*}(tau )}q_{tau }^{*}(tau ) is a gamma distribution. Specifically:


)∼Gamma⁡aN,bN)aN=a0+N+12bN=b0+12Eμ[∑n=1N(xn−μ)2+λ0(μμ0)2]{displaystyle {begin{aligned}q_{tau }^{*}(tau )&sim operatorname {Gamma} (tau mid a_{N},b_{N})\a_{N}&=a_{0}+{frac {N+1}{2}}\b_{N}&=b_{0}+{frac {1}{2}}operatorname {E} _{mu }left[sum _{n=1}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right]end{aligned}}}{begin{aligned}q_{tau }^{*}(tau )&sim operatorname {Gamma}(tau mid a_{N},b_{N})\a_{N}&=a_{0}+{frac  {N+1}{2}}\b_{N}&=b_{0}+{frac  {1}{2}}operatorname {E}_{mu }left[sum _{{n=1}}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right]end{aligned}}


Algorithm for computing the parameters


Let us recap the conclusions from the previous sections:


)∼N(μμN,λN−1)μN=λ0+Nx¯λ0+NλN=(λ0+N)Eτ]x¯=1N∑n=1Nxn{displaystyle {begin{aligned}q_{mu }^{*}(mu )&sim {mathcal {N}}(mu mid mu _{N},lambda _{N}^{-1})\mu _{N}&={frac {lambda _{0}mu _{0}+N{bar {x}}}{lambda _{0}+N}}\lambda _{N}&=(lambda _{0}+N)operatorname {E} _{tau }[tau ]\{bar {x}}&={frac {1}{N}}sum _{n=1}^{N}x_{n}end{aligned}}}{displaystyle {begin{aligned}q_{mu }^{*}(mu )&sim {mathcal {N}}(mu mid mu _{N},lambda _{N}^{-1})\mu _{N}&={frac {lambda _{0}mu _{0}+N{bar {x}}}{lambda _{0}+N}}\lambda _{N}&=(lambda _{0}+N)operatorname {E} _{tau }[tau ]\{bar {x}}&={frac {1}{N}}sum _{n=1}^{N}x_{n}end{aligned}}}

and


)∼Gamma⁡aN,bN)aN=a0+N+12bN=b0+12Eμ[∑n=1N(xn−μ)2+λ0(μμ0)2]{displaystyle {begin{aligned}q_{tau }^{*}(tau )&sim operatorname {Gamma} (tau mid a_{N},b_{N})\a_{N}&=a_{0}+{frac {N+1}{2}}\b_{N}&=b_{0}+{frac {1}{2}}operatorname {E} _{mu }left[sum _{n=1}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right]end{aligned}}}{begin{aligned}q_{tau }^{*}(tau )&sim operatorname {Gamma}(tau mid a_{N},b_{N})\a_{N}&=a_{0}+{frac  {N+1}{2}}\b_{N}&=b_{0}+{frac  {1}{2}}operatorname {E}_{mu }left[sum _{{n=1}}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right]end{aligned}}

In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable. We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions:


E⁡aN,bN]=aNbNE⁡μN,λN−1]=μNE⁡[X2]=Var⁡(X)+(E⁡[X])2E⁡2∣μN,λN−1]=λN−1+μN2{displaystyle {begin{aligned}operatorname {E} [tau mid a_{N},b_{N}]&={frac {a_{N}}{b_{N}}}\operatorname {E} left[mu mid mu _{N},lambda _{N}^{-1}right]&=mu _{N}\operatorname {E} left[X^{2}right]&=operatorname {Var} (X)+(operatorname {E} [X])^{2}\operatorname {E} left[mu ^{2}mid mu _{N},lambda _{N}^{-1}right]&=lambda _{N}^{-1}+mu _{N}^{2}end{aligned}}}<br />
begin{align}<br />
operatorname{E}[taumid a_N, b_N] &= frac{a_N}{b_N} \<br />
operatorname{E} left [mumidmu_N,lambda_N^{-1} right ] &= mu_N \<br />
operatorname{E}left[X^2 right] &= operatorname{Var}(X) + (operatorname{E}[X])^2 \<br />
operatorname{E} left [mu^2midmu_N,lambda_N^{-1} right ] &= lambda_N^{-1} + mu_N^2<br />
end{align}<br />

Applying these formulas to the above equations is trivial in most cases, but the equation for bN{displaystyle b_{N}}b_{N} takes more work:


bN=b0+12Eμ[∑n=1N(xn−μ)2+λ0(μμ0)2]=b0+12Eμ[(λ0+N)μ2−2(λ0+∑n=1Nxn)μ+(∑n=1Nxn2)+λ02]=b0+12[(λ0+N)Eμ2]−2(λ0+∑n=1Nxn)Eμ]+(∑n=1Nxn2)+λ02]=b0+12[(λ0+N)(λN−1+μN2)−2(λ0+∑n=1Nxn)μN+(∑n=1Nxn2)+λ02]{displaystyle {begin{aligned}b_{N}&=b_{0}+{frac {1}{2}}operatorname {E} _{mu }left[sum _{n=1}^{N}(x_{n}-mu )^{2}+lambda _{0}(mu -mu _{0})^{2}right]\&=b_{0}+{frac {1}{2}}operatorname {E} _{mu }left[(lambda _{0}+N)mu ^{2}-2left(lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}right)mu +left(sum _{n=1}^{N}x_{n}^{2}right)+lambda _{0}mu _{0}^{2}right]\&=b_{0}+{frac {1}{2}}left[(lambda _{0}+N)operatorname {E} _{mu }[mu ^{2}]-2left(lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}right)operatorname {E} _{mu }[mu ]+left(sum _{n=1}^{N}x_{n}^{2}right)+lambda _{0}mu _{0}^{2}right]\&=b_{0}+{frac {1}{2}}left[(lambda _{0}+N)left(lambda _{N}^{-1}+mu _{N}^{2}right)-2left(lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}right)mu _{N}+left(sum _{n=1}^{N}x_{n}^{2}right)+lambda _{0}mu _{0}^{2}right]\end{aligned}}}<br />
begin{align}<br />
b_N &= b_0 + frac{1}{2} operatorname{E}_mu left[sum_{n=1}^N (x_n-mu)^2 + lambda_0(mu - mu_0)^2right] \<br />
    &= b_0 + frac{1}{2} operatorname{E}_mu left[ (lambda_0+N)mu^2 -2 left (lambda_0mu_0 + sum_{n=1}^N x_n right )mu + left(sum_{n=1}^N x_n^2 right ) + lambda_0mu_0^2 right] \<br />
    &= b_0 + frac{1}{2} left[ (lambda_0+N)operatorname{E}_mu[mu^2] -2 left (lambda_0mu_0 + sum_{n=1}^N x_n right)operatorname{E}_mu [mu] + left (sum_{n=1}^N x_n^2 right ) + lambda_0mu_0^2 right] \<br />
    &= b_0 + frac{1}{2} left[ (lambda_0+N) left (lambda_N^{-1} + mu_N^2 right ) -2 left (lambda_0mu_0 + sum_{n=1}^N x_n right)mu_N + left(sum_{n=1}^N x_n^2 right) + lambda_0mu_0^2 right] \<br />
end{align}<br />

We can then write the parameter equations as follows, without any expectations:


μN=λ0+Nx¯λ0+NλN=(λ0+N)aNbNx¯=1N∑n=1NxnaN=a0+N+12bN=b0+12[(λ0+N)(λN−1+μN2)−2(λ0+∑n=1Nxn)μN+(∑n=1Nxn2)+λ02]{displaystyle {begin{aligned}mu _{N}&={frac {lambda _{0}mu _{0}+N{bar {x}}}{lambda _{0}+N}}\lambda _{N}&=(lambda _{0}+N){frac {a_{N}}{b_{N}}}\{bar {x}}&={frac {1}{N}}sum _{n=1}^{N}x_{n}\a_{N}&=a_{0}+{frac {N+1}{2}}\b_{N}&=b_{0}+{frac {1}{2}}left[(lambda _{0}+N)left(lambda _{N}^{-1}+mu _{N}^{2}right)-2left(lambda _{0}mu _{0}+sum _{n=1}^{N}x_{n}right)mu _{N}+left(sum _{n=1}^{N}x_{n}^{2}right)+lambda _{0}mu _{0}^{2}right]end{aligned}}}begin{align}<br />
mu_N &= frac{lambda_0 mu_0 + N bar{x}}{lambda_0 + N} \<br />
lambda_N &= (lambda_0 + N) frac{a_N}{b_N} \<br />
bar{x} &= frac{1}{N}sum_{n=1}^N x_n \<br />
a_N &= a_0 + frac{N+1}{2} \<br />
b_N &= b_0 + frac{1}{2} left[ (lambda_0+N) left (lambda_N^{-1} + mu_N^2 right ) -2 left (lambda_0mu_0 + sum_{n=1}^N x_n right )mu_N + left (sum_{n=1}^N x_n^2 right ) + lambda_0mu_0^2 right]<br />
end{align}

Note that there are circular dependencies among the formulas for λN{displaystyle lambda _{N}}lambda _{N}and bN{displaystyle b_{N}}b_{N}. This naturally suggests an EM-like algorithm:



  1. Compute n=1Nxn{displaystyle sum _{n=1}^{N}x_{n}}sum _{{n=1}}^{N}x_{n} and n=1Nxn2.{displaystyle sum _{n=1}^{N}x_{n}^{2}.}sum _{{n=1}}^{N}x_{n}^{2}. Use these values to compute μN{displaystyle mu _{N}}mu _{N} and aN.{displaystyle a_{N}.}a_{N}.

  2. Initialize λN{displaystyle lambda _{N}}lambda _{N} to some arbitrary value.

  3. Use the current value of λN,{displaystyle lambda _{N},}lambda _{N}, along with the known values of the other parameters, to compute bN{displaystyle b_{N}}b_{N}.

  4. Use the current value of bN,{displaystyle b_{N},}b_{N}, along with the known values of the other parameters, to compute λN{displaystyle lambda _{N}}lambda _{N}.

  5. Repeat the last two steps until convergence (i.e. until neither value has changed more than some small amount).


We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc.


It can be shown that this algorithm is guaranteed to converge to a local maximum.


Note also that the posterior distributions have the same form as the corresponding prior distributions. We did not assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally. It turns out (see below) that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of the exponential family, which is the case for most of the standard distributions.



Further discussion



Step-by-step recipe


The above example shows the method by which the variational-Bayesian approximation to a posterior probability density in a given Bayesian network is derived:



  1. Describe the network with a graphical model, identifying the observed variables (data) X{displaystyle mathbf {X} }mathbf {X} and unobserved variables (parameters Θ{displaystyle {boldsymbol {Theta }}}{boldsymbol  Theta } and latent variables Z{displaystyle mathbf {Z} }mathbf {Z} ) and their conditional probability distributions. Variational Bayes will then construct an approximation to the posterior probability p(Z,ΘX){displaystyle p(mathbf {Z} ,{boldsymbol {Theta }}mid mathbf {X} )}p({mathbf  {Z}},{boldsymbol  Theta }mid {mathbf  {X}}). The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more independent distributions over disjoint subsets of the unobserved variables.

  2. Partition the unobserved variables into two or more subsets, over which the independent factors will be derived. There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable. Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result. Assume that the partitions are called Z1,…,ZM{displaystyle mathbf {Z} _{1},ldots ,mathbf {Z} _{M}}{mathbf  {Z}}_{1},ldots ,{mathbf  {Z}}_{M}.

  3. For a given partition Zj{displaystyle mathbf {Z} _{j}}{mathbf  {Z}}_{j}, write down the formula for the best approximating distribution qj∗(Zj∣X){displaystyle q_{j}^{*}(mathbf {Z} _{j}mid mathbf {X} )}q_{j}^{{*}}({mathbf  {Z}}_{j}mid {mathbf  {X}}) using the basic equation ln⁡qj∗(Zj∣X)=Ei≠j⁡[ln⁡p(Z,X)]+constant{displaystyle ln q_{j}^{*}(mathbf {Z} _{j}mid mathbf {X} )=operatorname {E} _{ineq j}[ln p(mathbf {Z} ,mathbf {X} )]+{text{constant}}}ln q_{j}^{{*}}({mathbf  {Z}}_{j}mid {mathbf  {X}})=operatorname {E}_{{ineq j}}[ln p({mathbf  {Z}},{mathbf  {X}})]+{text{constant}} .

  4. Fill in the formula for the joint probability distribution using the graphical model. Any component conditional distributions that don't involve any of the variables in Zj{displaystyle mathbf {Z} _{j}}{mathbf  {Z}}_{j} can be ignored; they will be folded into the constant term.

  5. Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not in Zj{displaystyle mathbf {Z} _{j}}{mathbf  {Z}}_{j} (e.g. first or second raw moments, expectation of a logarithm, etc.). In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or hyperparameters of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition.

  6. The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates the probability density function (PDF) of the distribution (or at least, something proportional to it, with unknown normalization constant). In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula.

  7. When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions.

  8. When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters.

  9. An expectation maximization (EM) type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn. This is guaranteed to converge.



Most important points


Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are:



  1. The idea of variational Bayes is to construct an analytical approximation to the posterior probability of the set of unobserved variables (parameters and latent variables), given the data. This means that the form of the solution is similar to other Bayesian inference methods, such as Gibbs sampling — i.e. a distribution that seeks to describe everything that is known about the variables. As in other Bayesian methods — but unlike e.g. in expectation maximization (EM) or other maximum likelihood methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. as random variables. Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a single point estimate or deriving a credible interval, highest density region, etc.

  2. "Analytical approximation" means that a formula can be written down for the posterior distribution. The formula generally consists of a product of well-known probability distributions, each of which factorizes over a set of unobserved variables (i.e. it is conditionally independent of the other variables, given the observed data). This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowest moments of the unobserved variables, e.g. the mean and variance.

  3. The result of all of the mathematical manipulations is (1) the identity of the probability distributions making up the factors, and (2) mutually dependent formulas for the parameters of these distributions. The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.



Compared with expectation maximization (EM)


Variational Bayes (VB) is often compared with expectation maximization (EM). The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values. The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations.


However, there are a number of differences. Most important is what is being computed.



  • EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but only estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete). The point estimates computed are the modes of these parameters; no other information is available.

  • VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally the mean is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, it should be noted that the parameters computed in VB do not have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussian mixture model will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of a prior distribution, e.g. a normal-scaled inverse gamma distribution — and would then compute values for the parameters of this prior distribution, i.e. essentially hyperparameters. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.




A more complex example




Bayesian Gaussian mixture model using plate notation. Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K; [D,D] means a matrix of size D×D; K alone means a categorical variable with K outcomes. The squiggly line coming from z ending in a crossbar indicates a switch — the value of this variable selects, for the other incoming variables, which value to use out of the size-K array of possible values.


Imagine a Bayesian Gaussian mixture model described as follows:[5]


πSymDir⁡(K,α0)Λi=1…K∼W(W0,ν0)μi=1…K∼N(μ0,(βi)−1)z[i=1…N]∼Mult⁡(1,π)xi=1…N∼N(μzi,Λzi−1)K=number of mixing componentsN=number of data points{displaystyle {begin{aligned}mathbf {pi } &sim operatorname {SymDir} (K,alpha _{0})\mathbf {Lambda } _{i=1dots K}&sim {mathcal {W}}(mathbf {W} _{0},nu _{0})\mathbf {mu } _{i=1dots K}&sim {mathcal {N}}(mathbf {mu } _{0},(beta _{0}mathbf {Lambda } _{i})^{-1})\mathbf {z} [i=1dots N]&sim operatorname {Mult} (1,mathbf {pi } )\mathbf {x} _{i=1dots N}&sim {mathcal {N}}(mathbf {mu } _{z_{i}},{mathbf {Lambda } _{z_{i}}}^{-1})\K&={text{number of mixing components}}\N&={text{number of data points}}end{aligned}}}{begin{aligned}{mathbf  {pi }}&sim operatorname {SymDir}(K,alpha _{0})\{mathbf  {Lambda }}_{{i=1dots K}}&sim {mathcal  {W}}({mathbf  {W}}_{0},nu _{0})\{mathbf  {mu }}_{{i=1dots K}}&sim {mathcal  {N}}({mathbf  {mu }}_{0},(beta _{0}{mathbf  {Lambda }}_{i})^{{-1}})\{mathbf  {z}}[i=1dots N]&sim operatorname {Mult}(1,{mathbf  {pi }})\{mathbf  {x}}_{{i=1dots N}}&sim {mathcal  {N}}({mathbf  {mu }}_{{z_{i}}},{{mathbf  {Lambda }}_{{z_{i}}}}^{{-1}})\K&={text{number of mixing components}}\N&={text{number of data points}}end{aligned}}

Note:



  • SymDir() is the symmetric Dirichlet distribution of dimension K{displaystyle K}K, with the hyperparameter for each component set to α0{displaystyle alpha _{0}}alpha _{0}. The Dirichlet distribution is the conjugate prior of the categorical distribution or multinomial distribution.


  • W(){displaystyle {mathcal {W}}()}{mathcal  {W}}() is the Wishart distribution, which is the conjugate prior of the precision matrix (inverse covariance matrix) for a multivariate Gaussian distribution.

  • Mult() is a multinomial distribution over a single observation (equivalent to a categorical distribution). The state space is a "one-of-K" representation, i.e. a K{displaystyle K}K-dimensional vector in which one of the elements is 1 (specifying the identity of the observation) and all other elements are 0.


  • N(){displaystyle {mathcal {N}}()}mathcal{N}() is the Gaussian distribution, in this case specifically the multivariate Gaussian distribution.


The interpretation of the above variables is as follows:




  • X={x1,…,xN}{displaystyle mathbf {X} ={mathbf {x} _{1},dots ,mathbf {x} _{N}}}{mathbf  {X}}={{mathbf  {x}}_{1},dots ,{mathbf  {x}}_{N}} is the set of N{displaystyle N}N data points, each of which is a D{displaystyle D}D-dimensional vector distributed according to a multivariate Gaussian distribution.


  • Z={z1,…,zN}{displaystyle mathbf {Z} ={mathbf {z} _{1},dots ,mathbf {z} _{N}}}{mathbf  {Z}}={{mathbf  {z}}_{1},dots ,{mathbf  {z}}_{N}} is a set of latent variables, one per data point, specifying which mixture component the corresponding data point belongs to, using a "one-of-K" vector representation with components znk{displaystyle z_{nk}}z_{{nk}} for k=1…K{displaystyle k=1dots K}k=1dots K, as described above.


  • π{displaystyle mathbf {pi } }mathbf {pi } is the mixing proportions for the K{displaystyle K}K mixture components.


  • μi=1…K{displaystyle mathbf {mu } _{i=1dots K}}{mathbf  {mu }}_{{i=1dots K}} and Λi=1…K{displaystyle mathbf {Lambda } _{i=1dots K}}{mathbf  {Lambda }}_{{i=1dots K}} specify the parameters (mean and precision) associated with each mixture component.


The joint probability of all variables can be rewritten as


p(X,Z,π)=p(X∣Z,μ)p(Z∣π)p(π)p(μΛ)p(Λ){displaystyle p(mathbf {X} ,mathbf {Z} ,mathbf {pi } ,mathbf {mu } ,mathbf {Lambda } )=p(mathbf {X} mid mathbf {Z} ,mathbf {mu } ,mathbf {Lambda } )p(mathbf {Z} mid mathbf {pi } )p(mathbf {pi } )p(mathbf {mu } mid mathbf {Lambda } )p(mathbf {Lambda } )}p({mathbf  {X}},{mathbf  {Z}},{mathbf  {pi }},{mathbf  {mu }},{mathbf  {Lambda }})=p({mathbf  {X}}mid {mathbf  {Z}},{mathbf  {mu }},{mathbf  {Lambda }})p({mathbf  {Z}}mid {mathbf  {pi }})p({mathbf  {pi }})p({mathbf  {mu }}mid {mathbf  {Lambda }})p({mathbf  {Lambda }})

where the individual factors are


p(X∣Z,μ)=∏n=1N∏k=1KN(xn∣μk,Λk−1)znkp(Z∣π)=∏n=1N∏k=1Kπkznkp(π)=Γ(Kα0)Γ0)K∏k=1Kπ0−1p(μΛ)=∏k=1KN(μk∣μ0,(βk)−1)p(Λ)=∏k=1KW(Λk∣W0,ν0){displaystyle {begin{aligned}p(mathbf {X} mid mathbf {Z} ,mathbf {mu } ,mathbf {Lambda } )&=prod _{n=1}^{N}prod _{k=1}^{K}{mathcal {N}}(mathbf {x} _{n}mid mathbf {mu } _{k},mathbf {Lambda } _{k}^{-1})^{z_{nk}}\p(mathbf {Z} mid mathbf {pi } )&=prod _{n=1}^{N}prod _{k=1}^{K}pi _{k}^{z_{nk}}\p(mathbf {pi } )&={frac {Gamma (Kalpha _{0})}{Gamma (alpha _{0})^{K}}}prod _{k=1}^{K}pi _{k}^{alpha _{0}-1}\p(mathbf {mu } mid mathbf {Lambda } )&=prod _{k=1}^{K}{mathcal {N}}(mathbf {mu } _{k}mid mathbf {mu } _{0},(beta _{0}mathbf {Lambda } _{k})^{-1})\p(mathbf {Lambda } )&=prod _{k=1}^{K}{mathcal {W}}(mathbf {Lambda } _{k}mid mathbf {W} _{0},nu _{0})end{aligned}}}{begin{aligned}p({mathbf  {X}}mid {mathbf  {Z}},{mathbf  {mu }},{mathbf  {Lambda }})&=prod _{{n=1}}^{N}prod _{{k=1}}^{K}{mathcal  {N}}({mathbf  {x}}_{n}mid {mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k}^{{-1}})^{{z_{{nk}}}}\p({mathbf  {Z}}mid {mathbf  {pi }})&=prod _{{n=1}}^{N}prod _{{k=1}}^{K}pi _{k}^{{z_{{nk}}}}\p({mathbf  {pi }})&={frac  {Gamma (Kalpha _{0})}{Gamma (alpha _{0})^{K}}}prod _{{k=1}}^{K}pi _{k}^{{alpha _{0}-1}}\p({mathbf  {mu }}mid {mathbf  {Lambda }})&=prod _{{k=1}}^{K}{mathcal  {N}}({mathbf  {mu }}_{k}mid {mathbf  {mu }}_{0},(beta _{0}{mathbf  {Lambda }}_{k})^{{-1}})\p({mathbf  {Lambda }})&=prod _{{k=1}}^{K}{mathcal  {W}}({mathbf  {Lambda }}_{k}mid {mathbf  {W}}_{0},nu _{0})end{aligned}}

where


N(x∣μ)=1(2π)D/21|Σ|1/2exp⁡{−12(x−μ)TΣ1(x−μ)}W(ΛW,ν)=B(W,ν)|Λ|(νD−1)/2exp⁡(−12Tr⁡(W−))B(W,ν)=|W|−ν/2{2νD/2πD(D−1)/4∏i=1DΓ+1−i2)}−1D=dimensionality of each data point{displaystyle {begin{aligned}{mathcal {N}}(mathbf {x} mid mathbf {mu } ,mathbf {Sigma } )&={frac {1}{(2pi )^{D/2}}}{frac {1}{|mathbf {Sigma } |^{1/2}}}exp left{-{frac {1}{2}}(mathbf {x} -mathbf {mu } )^{rm {T}}mathbf {Sigma } ^{-1}(mathbf {x} -mathbf {mu } )right}\{mathcal {W}}(mathbf {Lambda } mid mathbf {W} ,nu )&=B(mathbf {W} ,nu )|mathbf {Lambda } |^{(nu -D-1)/2}exp left(-{frac {1}{2}}operatorname {Tr} (mathbf {W} ^{-1}mathbf {Lambda } )right)\B(mathbf {W} ,nu )&=|mathbf {W} |^{-nu /2}left{2^{nu D/2}pi ^{D(D-1)/4}prod _{i=1}^{D}Gamma left({frac {nu +1-i}{2}}right)right}^{-1}\D&={text{dimensionality of each data point}}end{aligned}}}{begin{aligned}{mathcal  {N}}({mathbf  {x}}mid {mathbf  {mu }},{mathbf  {Sigma }})&={frac  {1}{(2pi )^{{D/2}}}}{frac  {1}{|{mathbf  {Sigma }}|^{{1/2}}}}exp left{-{frac  {1}{2}}({mathbf  {x}}-{mathbf  {mu }})^{{{rm {T}}}}{mathbf  {Sigma }}^{{-1}}({mathbf  {x}}-{mathbf  {mu }})right}\{mathcal  {W}}({mathbf  {Lambda }}mid {mathbf  {W}},nu )&=B({mathbf  {W}},nu )|{mathbf  {Lambda }}|^{{(nu -D-1)/2}}exp left(-{frac  {1}{2}}operatorname {Tr}({mathbf  {W}}^{{-1}}{mathbf  {Lambda }})right)\B({mathbf  {W}},nu )&=|{mathbf  {W}}|^{{-nu /2}}left{2^{{nu D/2}}pi ^{{D(D-1)/4}}prod _{{i=1}}^{{D}}Gamma left({frac  {nu +1-i}{2}}right)right}^{{-1}}\D&={text{dimensionality of each data point}}end{aligned}}

Assume that q(Z,π)=q(Z)q(π){displaystyle q(mathbf {Z} ,mathbf {pi } ,mathbf {mu } ,mathbf {Lambda } )=q(mathbf {Z} )q(mathbf {pi } ,mathbf {mu } ,mathbf {Lambda } )}q({mathbf  {Z}},{mathbf  {pi }},{mathbf  {mu }},{mathbf  {Lambda }})=q({mathbf  {Z}})q({mathbf  {pi }},{mathbf  {mu }},{mathbf  {Lambda }}).


Then


ln⁡q∗(Z)=Eπ[ln⁡p(X,Z,π)]+constant=Eπ[ln⁡p(Z∣π)]+Eμ[ln⁡p(X∣Z,μ)]+constant=∑n=1N∑k=1Kznkln⁡ρnk+constant{displaystyle {begin{aligned}ln q^{*}(mathbf {Z} )&=operatorname {E} _{mathbf {pi } ,mathbf {mu } ,mathbf {Lambda } }[ln p(mathbf {X} ,mathbf {Z} ,mathbf {pi } ,mathbf {mu } ,mathbf {Lambda } )]+{text{constant}}\&=operatorname {E} _{mathbf {pi } }[ln p(mathbf {Z} mid mathbf {pi } )]+operatorname {E} _{mathbf {mu } ,mathbf {Lambda } }[ln p(mathbf {X} mid mathbf {Z} ,mathbf {mu } ,mathbf {Lambda } )]+{text{constant}}\&=sum _{n=1}^{N}sum _{k=1}^{K}z_{nk}ln rho _{nk}+{text{constant}}end{aligned}}}{begin{aligned}ln q^{*}({mathbf  {Z}})&=operatorname {E}_{{{mathbf  {pi }},{mathbf  {mu }},{mathbf  {Lambda }}}}[ln p({mathbf  {X}},{mathbf  {Z}},{mathbf  {pi }},{mathbf  {mu }},{mathbf  {Lambda }})]+{text{constant}}\&=operatorname {E}_{{{mathbf  {pi }}}}[ln p({mathbf  {Z}}mid {mathbf  {pi }})]+operatorname {E}_{{{mathbf  {mu }},{mathbf  {Lambda }}}}[ln p({mathbf  {X}}mid {mathbf  {Z}},{mathbf  {mu }},{mathbf  {Lambda }})]+{text{constant}}\&=sum _{{n=1}}^{N}sum _{{k=1}}^{K}z_{{nk}}ln rho _{{nk}}+{text{constant}}end{aligned}}

where we have defined


ln⁡ρnk=E⁡[ln⁡πk]+12E⁡[ln⁡k|]−D2ln⁡(2π)−12Eμk,Λk⁡[(xn−μk)TΛk(xn−μk)]{displaystyle ln rho _{nk}=operatorname {E} [ln pi _{k}]+{frac {1}{2}}operatorname {E} [ln |mathbf {Lambda } _{k}|]-{frac {D}{2}}ln(2pi )-{frac {1}{2}}operatorname {E} _{mathbf {mu } _{k},mathbf {Lambda } _{k}}[(mathbf {x} _{n}-mathbf {mu } _{k})^{rm {T}}mathbf {Lambda } _{k}(mathbf {x} _{n}-mathbf {mu } _{k})]}ln rho _{{nk}}=operatorname {E}[ln pi _{k}]+{frac  {1}{2}}operatorname {E}[ln |{mathbf  {Lambda }}_{k}|]-{frac  {D}{2}}ln(2pi )-{frac  {1}{2}}operatorname {E}_{{{mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k}}}[({mathbf  {x}}_{n}-{mathbf  {mu }}_{k})^{{{rm {T}}}}{mathbf  {Lambda }}_{k}({mathbf  {x}}_{n}-{mathbf  {mu }}_{k})]

Exponentiating both sides of the formula for ln⁡q∗(Z){displaystyle ln q^{*}(mathbf {Z} )}ln q^{*}({mathbf  {Z}}) yields


q∗(Z)∝n=1N∏k=1Kρnkznk{displaystyle q^{*}(mathbf {Z} )propto prod _{n=1}^{N}prod _{k=1}^{K}rho _{nk}^{z_{nk}}}q^{*}({mathbf  {Z}})propto prod _{{n=1}}^{N}prod _{{k=1}}^{K}rho _{{nk}}^{{z_{{nk}}}}

Requiring that this be normalized ends up requiring that the ρnk{displaystyle rho _{nk}}rho _{{nk}} sum to 1 over all values of k{displaystyle k}k, yielding


q∗(Z)=∏n=1N∏k=1Krnkznk{displaystyle q^{*}(mathbf {Z} )=prod _{n=1}^{N}prod _{k=1}^{K}r_{nk}^{z_{nk}}}q^{*}({mathbf  {Z}})=prod _{{n=1}}^{N}prod _{{k=1}}^{K}r_{{nk}}^{{z_{{nk}}}}

where


rnk=ρnk∑j=1Kρnj{displaystyle r_{nk}={frac {rho _{nk}}{sum _{j=1}^{K}rho _{nj}}}}r_{{nk}}={frac  {rho _{{nk}}}{sum _{{j=1}}^{K}rho _{{nj}}}}

In other words, q∗(Z){displaystyle q^{*}(mathbf {Z} )}q^{*}({mathbf  {Z}}) is a product of single-observation multinomial distributions, and factors over each individual zn{displaystyle mathbf {z} _{n}}{mathbf  {z}}_{n}, which is distributed as a single-observation multinomial distribution with parameters rnk{displaystyle r_{nk}}r_{{nk}} for k=1…K{displaystyle k=1dots K}k=1dots K.


Furthermore, we note that


E⁡[znk]=rnk{displaystyle operatorname {E} [z_{nk}]=r_{nk},}operatorname {E}[z_{{nk}}]=r_{{nk}},

which is a standard result for categorical distributions.


Now, considering the factor q(π){displaystyle q(mathbf {pi } ,mathbf {mu } ,mathbf {Lambda } )}q({mathbf  {pi }},{mathbf  {mu }},{mathbf  {Lambda }}), note that it automatically factors into q(π)∏k=1Kq(μk,Λk){displaystyle q(mathbf {pi } )prod _{k=1}^{K}q(mathbf {mu } _{k},mathbf {Lambda } _{k})}q({mathbf  {pi }})prod _{{k=1}}^{K}q({mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k}) due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.


Then,


ln⁡q∗)=ln⁡p(π)+EZ⁡[ln⁡p(Z∣π)]+constant=(α0−1)∑k=1Kln⁡πk+∑n=1N∑k=1Krnkln⁡πk+constant{displaystyle {begin{aligned}ln q^{*}(mathbf {pi } )&=ln p(mathbf {pi } )+operatorname {E} _{mathbf {Z} }[ln p(mathbf {Z} mid mathbf {pi } )]+{text{constant}}\&=(alpha _{0}-1)sum _{k=1}^{K}ln pi _{k}+sum _{n=1}^{N}sum _{k=1}^{K}r_{nk}ln pi _{k}+{text{constant}}end{aligned}}}{begin{aligned}ln q^{*}({mathbf  {pi }})&=ln p({mathbf  {pi }})+operatorname {E}_{{{mathbf  {Z}}}}[ln p({mathbf  {Z}}mid {mathbf  {pi }})]+{text{constant}}\&=(alpha _{0}-1)sum _{{k=1}}^{K}ln pi _{k}+sum _{{n=1}}^{N}sum _{{k=1}}^{K}r_{{nk}}ln pi _{k}+{text{constant}}end{aligned}}

Taking the exponential of both sides, we recognize q∗){displaystyle q^{*}(mathbf {pi } )}q^{*}({mathbf  {pi }}) as a Dirichlet distribution


q∗)∼Dir⁡){displaystyle q^{*}(mathbf {pi } )sim operatorname {Dir} (mathbf {alpha } ),}q^{*}({mathbf  {pi }})sim operatorname {Dir}({mathbf  {alpha }}),

where


αk=α0+Nk{displaystyle alpha _{k}=alpha _{0}+N_{k},}alpha _{k}=alpha _{0}+N_{k},

where


Nk=∑n=1Nrnk{displaystyle N_{k}=sum _{n=1}^{N}r_{nk},}N_{k}=sum _{{n=1}}^{N}r_{{nk}},

Finally


ln⁡q∗k,Λk)=ln⁡p(μk,Λk)+∑n=1NE⁡[znk]ln⁡N(xn∣μk,Λk−1)+constant{displaystyle ln q^{*}(mathbf {mu } _{k},mathbf {Lambda } _{k})=ln p(mathbf {mu } _{k},mathbf {Lambda } _{k})+sum _{n=1}^{N}operatorname {E} [z_{nk}]ln {mathcal {N}}(mathbf {x} _{n}mid mathbf {mu } _{k},mathbf {Lambda } _{k}^{-1})+{text{constant}}}ln q^{*}({mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k})=ln p({mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k})+sum _{{n=1}}^{N}operatorname {E}[z_{{nk}}]ln {mathcal  {N}}({mathbf  {x}}_{n}mid {mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k}^{{-1}})+{text{constant}}

Grouping and reading off terms involving μk{displaystyle mathbf {mu } _{k}}{mathbf  {mu }}_{k} and Λk{displaystyle mathbf {Lambda } _{k}}{mathbf  {Lambda }}_{k}, the result is a Gaussian-Wishart distribution given by


q∗k,Λk)=N(μk∣mk,(βk)−1)W(Λk∣Wk,νk){displaystyle q^{*}(mathbf {mu } _{k},mathbf {Lambda } _{k})={mathcal {N}}(mathbf {mu } _{k}mid mathbf {m} _{k},(beta _{k}mathbf {Lambda } _{k})^{-1}){mathcal {W}}(mathbf {Lambda } _{k}mid mathbf {W} _{k},nu _{k})}q^{*}({mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k})={mathcal  {N}}({mathbf  {mu }}_{k}mid {mathbf  {m}}_{k},(beta _{k}{mathbf  {Lambda }}_{k})^{{-1}}){mathcal  {W}}({mathbf  {Lambda }}_{k}mid {mathbf  {W}}_{k},nu _{k})

given the definitions


βk=β0+Nkmk=1βk(β0+Nkx¯k)Wk−1=W0−1+NkSk+β0Nkβ0+Nk(x¯k−μ0)(x¯k−μ0)Tνk=ν0+NkNk=∑n=1Nrnkx¯k=1Nk∑n=1NrnkxnSk=1Nk∑n=1Nrnk(xn−k)(xn−k)T{displaystyle {begin{aligned}beta _{k}&=beta _{0}+N_{k}\mathbf {m} _{k}&={frac {1}{beta _{k}}}(beta _{0}mathbf {mu } _{0}+N_{k}{bar {mathbf {x} }}_{k})\mathbf {W} _{k}^{-1}&=mathbf {W} _{0}^{-1}+N_{k}mathbf {S} _{k}+{frac {beta _{0}N_{k}}{beta _{0}+N_{k}}}({bar {mathbf {x} }}_{k}-mathbf {mu } _{0})({bar {mathbf {x} }}_{k}-mathbf {mu } _{0})^{rm {T}}\nu _{k}&=nu _{0}+N_{k}\N_{k}&=sum _{n=1}^{N}r_{nk}\{bar {mathbf {x} }}_{k}&={frac {1}{N_{k}}}sum _{n=1}^{N}r_{nk}mathbf {x} _{n}\mathbf {S} _{k}&={frac {1}{N_{k}}}sum _{n=1}^{N}r_{nk}(mathbf {x} _{n}-{bar {mathbf {x} }}_{k})(mathbf {x} _{n}-{bar {mathbf {x} }}_{k})^{rm {T}}end{aligned}}}{displaystyle {begin{aligned}beta _{k}&=beta _{0}+N_{k}\mathbf {m} _{k}&={frac {1}{beta _{k}}}(beta _{0}mathbf {mu } _{0}+N_{k}{bar {mathbf {x} }}_{k})\mathbf {W} _{k}^{-1}&=mathbf {W} _{0}^{-1}+N_{k}mathbf {S} _{k}+{frac {beta _{0}N_{k}}{beta _{0}+N_{k}}}({bar {mathbf {x} }}_{k}-mathbf {mu } _{0})({bar {mathbf {x} }}_{k}-mathbf {mu } _{0})^{rm {T}}\nu _{k}&=nu _{0}+N_{k}\N_{k}&=sum _{n=1}^{N}r_{nk}\{bar {mathbf {x} }}_{k}&={frac {1}{N_{k}}}sum _{n=1}^{N}r_{nk}mathbf {x} _{n}\mathbf {S} _{k}&={frac {1}{N_{k}}}sum _{n=1}^{N}r_{nk}(mathbf {x} _{n}-{bar {mathbf {x} }}_{k})(mathbf {x} _{n}-{bar {mathbf {x} }}_{k})^{rm {T}}end{aligned}}}

Finally, notice that these functions require the values of rnk{displaystyle r_{nk}}r_{{nk}}, which make use of ρnk{displaystyle rho _{nk}}rho _{{nk}}, which is defined in turn based on E⁡[ln⁡πk]{displaystyle operatorname {E} [ln pi _{k}]}operatorname {E}[ln pi _{k}], E⁡[ln⁡k|]{displaystyle operatorname {E} [ln |mathbf {Lambda } _{k}|]}operatorname {E}[ln |{mathbf  {Lambda }}_{k}|], and k,Λk⁡[(xn−μk)TΛk(xn−μk)]{displaystyle operatorname {E} _{mathbf {mu } _{k},mathbf {Lambda } _{k}}[(mathbf {x} _{n}-mathbf {mu } _{k})^{rm {T}}mathbf {Lambda } _{k}(mathbf {x} _{n}-mathbf {mu } _{k})]}operatorname {E}_{{{mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k}}}[({mathbf  {x}}_{n}-{mathbf  {mu }}_{k})^{{{rm {T}}}}{mathbf  {Lambda }}_{k}({mathbf  {x}}_{n}-{mathbf  {mu }}_{k})]. Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:


k,Λk⁡[(xn−μk)TΛk(xn−μk)]=Dβk−1+νk(xn−mk)TWk(xn−mk)ln⁡Λ~k≡E⁡[ln⁡k|]=∑i=1Dψk+1−i2)+Dln⁡2+ln⁡|Wk|ln⁡π~k≡E⁡[ln⁡k|]=ψk)−ψ(∑i=1Kαi){displaystyle {begin{aligned}operatorname {E} _{mathbf {mu } _{k},mathbf {Lambda } _{k}}[(mathbf {x} _{n}-mathbf {mu } _{k})^{rm {T}}mathbf {Lambda } _{k}(mathbf {x} _{n}-mathbf {mu } _{k})]&=Dbeta _{k}^{-1}+nu _{k}(mathbf {x} _{n}-mathbf {m} _{k})^{rm {T}}mathbf {W} _{k}(mathbf {x} _{n}-mathbf {m} _{k})\ln {tilde {Lambda }}_{k}&equiv operatorname {E} [ln |mathbf {Lambda } _{k}|]=sum _{i=1}^{D}psi left({frac {nu _{k}+1-i}{2}}right)+Dln 2+ln |mathbf {W} _{k}|\ln {tilde {pi }}_{k}&equiv operatorname {E} left[ln |pi _{k}|right]=psi (alpha _{k})-psi left(sum _{i=1}^{K}alpha _{i}right)end{aligned}}}{begin{aligned}operatorname {E}_{{{mathbf  {mu }}_{k},{mathbf  {Lambda }}_{k}}}[({mathbf  {x}}_{n}-{mathbf  {mu }}_{k})^{{{rm {T}}}}{mathbf  {Lambda }}_{k}({mathbf  {x}}_{n}-{mathbf  {mu }}_{k})]&=Dbeta _{k}^{{-1}}+nu _{k}({mathbf  {x}}_{n}-{mathbf  {m}}_{k})^{{{rm {T}}}}{mathbf  {W}}_{k}({mathbf  {x}}_{n}-{mathbf  {m}}_{k})\ln {{tilde  {Lambda }}}_{k}&equiv operatorname {E}[ln |{mathbf  {Lambda }}_{k}|]=sum _{{i=1}}^{D}psi left({frac  {nu _{k}+1-i}{2}}right)+Dln 2+ln |{mathbf  {W}}_{k}|\ln {{tilde  {pi }}}_{k}&equiv operatorname {E}left[ln |pi _{k}|right]=psi (alpha _{k})-psi left(sum _{{i=1}}^{K}alpha _{i}right)end{aligned}}

These results lead to


rnk∝π~~k1/2exp⁡{−D2βk−νk2(xn−mk)TWk(xn−mk)}{displaystyle r_{nk}propto {tilde {pi }}_{k}{tilde {Lambda }}_{k}^{1/2}exp left{-{frac {D}{2beta _{k}}}-{frac {nu _{k}}{2}}(mathbf {x} _{n}-mathbf {m} _{k})^{rm {T}}mathbf {W} _{k}(mathbf {x} _{n}-mathbf {m} _{k})right}}r_{{nk}}propto {{tilde  {pi }}}_{k}{{tilde  {Lambda }}}_{k}^{{1/2}}exp left{-{frac  {D}{2beta _{k}}}-{frac  {nu _{k}}{2}}({mathbf  {x}}_{n}-{mathbf  {m}}_{k})^{{{rm {T}}}}{mathbf  {W}}_{k}({mathbf  {x}}_{n}-{mathbf  {m}}_{k})right}

These can be converted from proportional to absolute values by normalizing over k{displaystyle k}k so that the corresponding values sum to 1.


Note that:



  1. The update equations for the parameters βk{displaystyle beta _{k}}beta _{k}, mk{displaystyle mathbf {m} _{k}}{mathbf  {m}}_{k}, Wk{displaystyle mathbf {W} _{k}}{mathbf  {W}}_{k} and νk{displaystyle nu _{k}}nu _{k} of the variables μk{displaystyle mathbf {mu } _{k}}{mathbf  {mu }}_{k} and Λk{displaystyle mathbf {Lambda } _{k}}{mathbf  {Lambda }}_{k} depend on the statistics Nk{displaystyle N_{k}}N_{k}, k{displaystyle {bar {mathbf {x} }}_{k}}{{bar  {{mathbf  {x}}}}}_{k}, and Sk{displaystyle mathbf {S} _{k}}{mathbf  {S}}_{k}, and these statistics in turn depend on rnk{displaystyle r_{nk}}r_{{nk}}.

  2. The update equations for the parameters α1…K{displaystyle alpha _{1dots K}}alpha _{{1dots K}} of the variable π{displaystyle mathbf {pi } }mathbf {pi } depend on the statistic Nk{displaystyle N_{k}}N_{k}, which depends in turn on rnk{displaystyle r_{nk}}r_{{nk}}.

  3. The update equation for rnk{displaystyle r_{nk}}r_{{nk}} has a direct circular dependence on βk{displaystyle beta _{k}}beta _{k}, mk{displaystyle mathbf {m} _{k}}{mathbf  {m}}_{k}, Wk{displaystyle mathbf {W} _{k}}{mathbf  {W}}_{k} and νk{displaystyle nu _{k}}nu _{k} as well as an indirect circular dependence on Wk{displaystyle mathbf {W} _{k}}{mathbf  {W}}_{k}, νk{displaystyle nu _{k}}nu _{k} and α1…K{displaystyle alpha _{1dots K}}alpha _{{1dots K}} through π~k{displaystyle {tilde {pi }}_{k}}{{tilde  {pi }}}_{k} and Λ~k{displaystyle {tilde {Lambda }}_{k}}{{tilde  {Lambda }}}_{k}.


This suggests an iterative procedure that alternates between two steps:



  1. An E-step that computes the value of rnk{displaystyle r_{nk}}r_{{nk}} using the current values of all the other parameters.

  2. An M-step that uses the new value of rnk{displaystyle r_{nk}}r_{{nk}} to compute new values of all the other parameters.


Note that these steps correspond closely with the standard EM algorithm to derive a maximum likelihood or maximum a posteriori (MAP) solution for the parameters of a Gaussian mixture model. The responsibilities rnk{displaystyle r_{nk}}r_{{nk}} in the E step correspond closely to the posterior probabilities of the latent variables given the data, i.e. p(Z∣X){displaystyle p(mathbf {Z} mid mathbf {X} )}p({mathbf  {Z}}mid {mathbf  {X}}); the computation of the statistics Nk{displaystyle N_{k}}N_{k}, k{displaystyle {bar {mathbf {x} }}_{k}}{{bar  {{mathbf  {x}}}}}_{k}, and Sk{displaystyle mathbf {S} _{k}}{mathbf  {S}}_{k} corresponds closely to the computation of corresponding "soft-count" statistics over the data; and the use of those statistics to compute new values of the parameters corresponds closely to the use of soft counts to compute new parameter values in normal EM over a Gaussian mixture model.



Exponential-family distributions


Note that in the previous example, once the distribution over unobserved variables was assumed to factorize into distributions over the "parameters" and distributions over the "latent data", the derived "best" distribution for each variable was in the same family as the corresponding prior distribution over the variable. This is a general result that holds true for all prior distributions derived from the exponential family.



See also




  • Variational message passing: a modular algorithm for variational Bayesian inference.


  • Expectation-maximization algorithm: a related approach which corresponds to a special case of variational Bayesian inference.


  • Generalized filtering: a variational filtering scheme for nonlinear state space models.


  • Calculus of variations: the field of mathematical analysis that deals with maximizing or minimizing functionals.


  • Maximum Entropy Discrimination: This is a variational inference framework that allows for introducing and accounting for additional large-margin constraints [6]



Notes





  1. ^ abcd Tran, Viet Hung (2018). "Copula Variational Bayes inference via information geometry". arXiv:1803.10998 [cs.IT]..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output q{quotes:"""""""'""'"}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}


  2. ^ ab Adamčík, Martin (2014). "The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning". Entropy. 16 (12): 6338–6381. Bibcode:2014Entrp..16.6338A. doi:10.3390/e16126338.


  3. ^ Boyd, Stephen P.; Vandenberghe, Lieven (2004). Convex Optimization (pdf). Cambridge University Press. ISBN 978-0-521-83378-3. Retrieved October 15, 2011.


  4. ^ Based on Chapter 10 of Pattern Recognition and Machine Learning by Christopher M. Bishop


  5. ^ Based on Chapter 10 of Pattern Recognition and Machine Learning by Christopher M. Bishop


  6. ^ Sotirios P. Chatzis, “Infinite Markov-Switching Maximum Entropy Discrimination Machines,” Proc. 30th International Conference on Machine Learning (ICML). Journal of Machine Learning Research: Workshop and Conference Proceedings, vol. 28, no. 3, pp. 729-737, June 2013.




References



  • Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. ISBN 978-0-387-31073-2.


External links




  • The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay provides an introduction to variational methods (p. 422).


  • A Tutorial on Variational Bayes. Fox, C. and Roberts, S. 2012. Artificial Intelligence Review, doi:10.1007/s10462-011-9236-8.


  • Variational-Bayes Repository A repository of research papers, software, and links related to the use of variational methods for approximate Bayesian learning up to 2003.


  • Variational Algorithms for Approximate Bayesian Inference, by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.


  • High-Level Explanation of Variational Inference by Jason Eisner may be worth reading before a more mathematically detailed treatment.


  • Copula Variational Bayes inference via information geometry (pdf) by Tran, V.H. 2018. This paper is primarily written for students. Via Bregman divergence, the paper shows that Variational Bayes is simply a generalized Pythagorean projection of true model onto an arbitrarily correlated (copula) distributional space, of which the independent space is merely a special case.




Comments

Popular posts from this blog

Information security

章鱼与海女图

Farm Security Administration