KL Divergence and its variants

June 5, 2025

by Leonardo

1. KL Divergence and its variants

$KL (𝑃 ‖ 𝑄) = 𝐸_{𝑥 \sim 𝑃} [\log \frac{𝑃 (𝑥)}{𝑄 (𝑥)}]$

Forward (inclusive) KL: $KL (𝑃 ‖ 𝑄)$ where $𝑃$ is the true distribution and $𝑄$ is the approximating distribution.

Mode covering: Since we sample from $𝑃$ and penalize when $𝑄 (𝑥)$ is small, $𝑄$ tends to cover all modes of $𝑃$ (even if it means being over-dispersed). If $𝑃 (𝑥) > 0$ but $𝑄 (𝑥) \approx 0$ , the penalty is large.
Typical use: Maximize likelihood, VAE (ensures decoder explains all data)

Reverse (exclusive) KL: $KL (𝑄 ‖ 𝑃)$

Mode seeking: Since we sample from $𝑄$ and penalize when $𝑃 (𝑥)$ is small, $𝑄$ tends to concentrate on a single mode of $𝑃$ (under-dispersed but sharper). If $𝑄 (𝑥) > 0$ but $𝑃 (𝑥) \approx 0$ , the penalty is large.
Reduces exposure bias ¹
Typical use: Variational inference, policy optimization in RL

1.1. Jensen-Shannon Divergence

JSD (𝑃 ‖ 𝑄) = \frac{1}{2} (KL (𝑃 ‖ 𝑀) + KL (𝑄 ‖ 𝑀)), 𝑀 = \frac{1}{2} (𝑃 + 𝑄)

JSD measures divergence relative to the mixture distribution $𝑀$ , which makes it:

Symmetric: $JSD (𝑃 ‖ 𝑄) = JSD (𝑄 ‖ 𝑃)$ (unlike KL)
Bounded: $0 \leq JSD (𝑃 ‖ 𝑄) \leq 1$ (log 2 when distributions have disjoint support)
Nearly a metric: $\sqrt{JSD (𝑃 ‖ 𝑄)}$ satisfies triangle inequality
Balances between mode-covering and mode-seeking behavior of KL variants

1.2. Wasserstein Distance

The distribution of $𝑇 (𝑥)$ is called the push-forward of $𝑃$ , denoted by $𝑇_{#} 𝑃 (𝐴) = 𝑃 ({𝑥 : 𝑇 (𝑥) \in 𝐴}) = 𝑃 (𝑇^{- 1} (𝐴))$

The Monge version of the optimal transport distance is $\inf_{𝑇} \int {‖ 𝑥 - 𝑇 (𝑥) ‖}^{𝑝} 𝑑 𝑃 (𝑥)$ where the infimum is over all $𝑇$ such that $𝑇_{#} 𝑃 = 𝑄$ . Intuitively, this measures how far you have to move the mass of $𝑃$ to turn it into $𝑄$ . A minimizer $𝑇^{*}$ , if one exists, is called the optimal transport map.

Let $Π (𝑃, 𝑄)$ denote all joint distributions $𝜋$ for $(𝑋, 𝑌)$ that have marginals $𝑃$ and $𝑄$ . In other words, $𝑇_{𝑋 #} 𝜋 = 𝑃$ and $𝑇_{𝑌 #} 𝜋 = 𝑄$ where $𝑇_{𝑋} (𝑥, 𝑦) = 𝑥$ and $𝑇_{𝑌} (𝑥, 𝑦) = 𝑦$ . Then the Wasserstein distance is

𝑊_{𝑝} (𝑃, 𝑄) = {(\inf_{𝛾 \in Π (𝑃, 𝑄)} \int {‖ 𝑥 - 𝑦 ‖}^{𝑝} 𝑑 𝛾 (𝑥, 𝑦))}^{\frac{1}{𝑝}} = {(\inf_{𝛾 \in Π (𝑃, 𝑄)} 𝐸_{𝑥, 𝑦 \sim 𝛾} [{‖ 𝑥 - 𝑦 ‖}^{𝑝}])}^{\frac{1}{𝑝}}

where $𝑝 \geq 1$ . When $𝑝 = 1$ , this is also called the Earth Mover's Distance.

It can be shown from Kantorovich Rubinstein Duality that

𝑊_{𝑝}^{𝑝} (𝑃, 𝑄) = \sum_{𝜓, 𝜑} \int 𝜓 (𝑦) 𝑑 𝑄 (𝑦) - \int 𝜑 (𝑥) 𝑑 𝑃 (𝑥)

where $𝜓 (𝑦) - 𝜑 (𝑥) \leq {‖ 𝑥 - 𝑦 ‖}^{𝑝}$ . When $𝑝 = 1$ , we have

𝑊_{1} (𝑃, 𝑄) = \sup_{{‖ 𝑇 ‖}_{𝐿} \leq 1} 𝐸_{𝑥 \sim 𝑃} [𝑇 (𝑥)] - 𝐸_{𝑦 \sim 𝑄} [𝑇 (𝑦)]

where ${‖ 𝑇 ‖}_{𝐿} \leq 1$ means $| 𝑇 (𝑥) - 𝑇 (𝑦) | \leq ‖ 𝑥 - 𝑦 ‖$ .

When to use Wasserstein Distance instead of KL:

Non-overlapping distributions: KL divergence becomes infinite (or undefined) when distributions have non-overlapping support, while WD remains finite and meaningful. This is critical in high-dimensional spaces where distributions rarely overlap perfectly.
Meaningful gradients: Even when distributions barely overlap, WD provides useful gradients for optimization. This is why Wasserstein GAN (WGAN)) works better than vanilla GAN - it can still learn when the generator distribution is far from the real data distribution.
True metric: WD is a proper distance metric (satisfies triangle inequality), making it more suitable for geometric interpretations and certain theoretical analyses.
Weak topology: WD convergence is weaker than KL convergence, meaning $𝑊 (𝑃_{𝑛}, 𝑃) \to 0$ implies convergence in distribution, which is often more natural for generative modeling.

1.3. Fisher Divergence

𝐹 (𝑃 ‖ 𝑄) = \frac{1}{2} 𝐸_{𝑥 \sim 𝑃} [{‖ \nabla_{𝑥} \log 𝑃 (𝑥) - \nabla_{𝑥} \log 𝑄 (𝑥) ‖}_{2}^{2}]

Unlike KL divergence which compares probability values, Fisher divergence compares the score functions (gradients of log probabilities). This makes it particularly useful when:

Dealing with unnormalized distributions (only need score functions, not normalization constants)
Training score-based generative models
The score function is more well-behaved than the density itself

1.4. Applications

Variational Inference: Reverse KL (mode-seeking behavior prevents over-dispersed approximations)
GAN: JSD (symmetric, bounded measure between real and generated distributions)
WGAN: Wasserstein Distance (stable training with meaningful gradients)
VAE: Forward KL (mode-covering ensures all data modes are explained)
RL (e.g., PPO, TRPO): Reverse KL (prevents policy from assigning probability to bad actions)
Score-based models (diffusion): Fisher Divergence (training without normalized densities)