Kullback-Leibler Divergence (KLD)
Kullback-Leibler Divergence
Kullback-Leibler Divergence is a type of statistical distance: a measure of how one probability distribution \(P\) is different (or similar) from the other probability distribution \(Q\).
Notation
\[\begin{align*} D_{KL}(P || Q) \\ KL(P || Q) \end{align*}\]Defintion
For discrete probability distributions \(P\) and \(Q\) defined on the same sample space \(\mathcal{X}\), the relative entropy from \(Q\) to \(P\) is defined to be \(KL(P || Q) = \sum_{x\in\mathcal{X}}P(x)\log{\frac{P(x)}{Q(x)}}\) For distributions \(P\) and \(Q\) of a continuous random variable, the relative entropy is defined to be \(KL(P || Q) = \int_{-\infty}^{+\infty}p(x)\log{\frac{p(x)}{q(x)}}dx\)
Applying Kullback-Leibler Divergence to Bayesian Backpropagation
\[\begin{align*} \theta^{*} &= \text{argmin}_{\theta}KL[q(w|\theta) \; || \; P(w|\mathcal{D})] \\ &= \text{argmin}_{\theta}\int{q(w|\theta)\log{\frac{q(w|\theta)}{P(w|\mathcal{D})}}}dw \\ &= \text{argmin}_{\theta}\int{q(w|\theta)\log{\frac{q(w|\theta)P(\mathcal{D})}{P(w)P(\mathcal{D}|w)}}}dw \\ &= \text{argmin}_{\theta}\int{q(w|\theta)\log{\frac{q(w|\theta)}{P(w)P(\mathcal{D}|w)}}}dw \\ &= \text{argmin}_{\theta}\left(\int{q(w|\theta)\log{\frac{q(w|\theta)}{P(w)}}}dw \; - \; \int{q(w|\mathcal{D})\log{P(\mathcal{D}|w)}dw} \right) \\ &= KL[q(w|\theta) \; || \; P(w)] \; - \; \mathbb{E}_{q(w|\mathcal{D})}[\log{P(\mathcal{D}|w)}] \end{align*}\]Complexity cost (prior-dependent part): \(KL[q(w \mid \theta) \mid \mid P(w)]\)
Likelihood cost (data-dependent part): \(\mathbb{E}_{q(w \mid \mathcal{D})}[P(\mathcal{D}\mid w)]\)
Resulting cost function:
\[\mathcal{F}(\mathcal{D}, \theta) = KL[q(w|\theta) || P(w)] - \mathbb{E}_{q(w|\mathcal{D})}[\log P(\mathcal{D}|w)]\]