Latent Variable Models

June 8, 2025

by Leonardo

1. Use latent variable model

𝑝 (𝑥) = \int 𝑝 (𝑥 | 𝑧) 𝑝 (𝑧) 𝑑 𝑧

𝑝_{𝜃} (𝑥 | 𝑧) \to 𝑝 (𝑥 | 𝑧)

$𝑝 (𝑧)$ is the prior distribution, will be a predefined density function.

𝑝 (𝑥) = \int 𝑝 (𝑥 | 𝑧) 𝑝 (𝑧) 𝑑 𝑧

𝑞 (𝑥 | 𝑧) \to 𝑝 (𝑥 | 𝑧)

$𝑝 (𝑧)$ is the prior distribution, will be a predefined density function.

What we want is to learn $𝑝_{𝜃} (𝑥 | 𝑧)$ to approximate $𝑝 (𝑥 | 𝑧)$ , which is usually measured by the KL divergence. But its hard to deal with that, so we approximate $𝑝 (𝑥, 𝑧)$ instead since $𝑝 (𝑥, 𝑧) = 𝑝 (𝑥 | 𝑧) 𝑝 (𝑧)$ .

\begin{matrix} KL (𝑝 (𝑥, 𝑧) ‖ 𝑝_{𝜃} (𝑥, 𝑧)) & = \iint 𝑝 (𝑥, 𝑧) \log (\frac{𝑝 (𝑥, 𝑧)}{𝑝_{𝜃} (𝑥, 𝑧)}) 𝑑 𝑥 𝑑 𝑧 \\ = \int 𝑝 (𝑥) [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝 (𝑥) 𝑝 (𝑧 | 𝑥)}{𝑝_{𝜃} (𝑥, 𝑧)} 𝑑 𝑧] 𝑑 𝑥 \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) (\log 𝑝 (𝑥) + \log \frac{𝑝 (𝑧 | 𝑥)}{𝑝_{𝜃} (𝑥, 𝑧)}) 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\log 𝑝 (𝑥) \int 𝑝 (𝑧 | 𝑥) 𝑑 𝑧] + 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝 (𝑧 | 𝑥)}{𝑝_{𝜃} (𝑥, 𝑧)} 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\log 𝑝 (𝑥)] - 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝_{𝜃} (𝑥, 𝑧)}{𝑝 (𝑧 | 𝑥)} 𝑑 𝑧] \end{matrix}

The first term is a constant, so we only need to maximize the second term:

\begin{matrix} ℒ & = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝_{𝜃} (𝑥, 𝑧)}{𝑝 (𝑧 | 𝑥)} 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝_{𝜃} (𝑥 | 𝑧) 𝑝 (𝑧)}{𝑝 (𝑧 | 𝑥)} 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log 𝑝_{𝜃} (𝑥 | 𝑧) 𝑑 𝑧 + \int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝 (𝑧)}{𝑝 (𝑧 | 𝑥)} 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [𝐸_{𝑧 \sim 𝑝 (𝑧 | 𝑥)} [\log 𝑝_{𝜃} (𝑥 | 𝑧)]] - KL (𝑝 (𝑧 | 𝑥) ‖ 𝑝 (𝑧)) \end{matrix}

This is Evidence Lower Bound (ELBO). But is maximizing the ELBO similar to doing maximum likelihood estimation (MLE)? Yes, since we can show that

\begin{matrix} 𝐸_{𝑥 \sim 𝑝 (𝑥)} [ELBO (𝑥)] & = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝_{𝜃} (𝑥, 𝑧)}{𝑝 (𝑧 | 𝑥)} 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\int 𝑝 (𝑧 | 𝑥) \log \frac{𝑝_{𝜃} (𝑥) 𝑝_{𝜃} (𝑧 | 𝑥)}{𝑝 (𝑧 | 𝑥)} 𝑑 𝑧] \\ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\log 𝑝_{𝜃} (𝑥)] - KL (𝑝 (𝑧 | 𝑥) ‖ 𝑝_{𝜃} (𝑧 | 𝑥))] \end{matrix}

1.1. VAE

We choose $𝑝 (𝑧)$ to be $𝒩 (0, 1)$ , and use networks to approximate $𝑝 (𝑧 | 𝑥)$ and $𝑝 (𝑥 | 𝑧)$ .

\begin{matrix} (𝜇, 𝜎^{2}) = {EncoderNetwork}_{𝜑} (𝑥), \\ 𝑞_{𝜑} (𝑧 | 𝑥) = 𝒩 (𝑧 | 𝜇, diag (𝜎^{2})) \end{matrix}

$𝑝 (𝑥 | 𝑧)$ is one-to-one mapping. We use $𝑝_{𝜃} (𝑥 | 𝑧) = 𝛿 (𝑥 - 𝑓_{𝜃} (𝑧))$ to approximate $𝑝 (𝑥 | 𝑧)$ .

\begin{matrix} 𝑓_{𝜃} (𝑧) = {DecoderNetwork}_{𝜃} (𝑧), \\ 𝑝_{𝜃} (𝑥 | 𝑧) = 𝒩 (𝑥 | 𝑓_{𝜃} (𝑧), 𝜎_{dec}^{2} 𝐼) \end{matrix}

where $𝜎_{dec}$ is a hyperparameter. Thus the first term of the loss function is

\begin{matrix} 𝐸_{𝑥 \sim 𝑝 (𝑥)} [𝐸_{𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥)} [- \log 𝑝_{𝜃} (𝑥 | 𝑧)]] & = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [𝐸_{𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥)} [- \log \frac{1}{\sqrt{2 𝜋} 𝜎_{dec}} \exp (- \frac{{(𝑥 - 𝑓_{𝜃} (𝑧))}^{2}}{2 𝜎_{dec}^{2}})]] \\ = \frac{1}{2 𝜎_{dec}^{2}} 𝐸_{𝑥 \sim 𝑝 (𝑥)} [𝐸_{𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥)} [{(𝑥 - 𝑓_{𝜃} (𝑧))}^{2}]] - \log \frac{1}{\sqrt{2 𝜋} 𝜎_{dec}} \end{matrix}

Since $KL (𝑁_{0} ‖ 𝑁_{1}) = \frac{1}{2} (tr (\sum_{1}^{- 1} \sum_{0}) + {(𝜇_{1} - 𝜇_{0})}^{𝑇} \sum_{1}^{- 1} (𝜇_{1} - 𝜇_{0}) + \log \frac{| \sum_{1} |}{| \sum_{0} |} - 𝑘)$ , the second term is

KL (𝑞_{𝜑} (𝑧 | 𝑥) ‖ 𝑝 (𝑧)) = \frac{1}{2} (- \log 𝜎^{2} + 𝜇^{2} + 𝜎^{2} - 1)

We are trying to find $(𝜑, 𝜃) = \arg \max_{𝜑, 𝜃} 𝐸_{𝑥 \sim 𝑝 (𝑥)} [ELBO (𝑥)]$ .

1.1.1. Conditioned VAE (CVAE)

We define $𝐿_{CVAE} = 𝐸_{(𝑥, 𝑦) \sim 𝑝 (𝑥, 𝑦)} [𝐸_{𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥, 𝑦)} [\log 𝑝_{𝜃} (𝑦 | 𝑥, 𝑧)]] - KL (𝑞_{𝜑} (𝑧 | 𝑥, 𝑦) ‖ 𝑝_{𝜃} (𝑧 | 𝑥))$ and also Gaussian stochastic neural network (GSNN) with loss $𝐿_{GSNN} = 𝐸_{(𝑥, 𝑦) \sim 𝑝 (𝑥, 𝑦)} [𝐸_{𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥)} [\log 𝑝_{𝜃} (𝑦 | 𝑥, 𝑧)]]$ . The total loss is $𝐿_{hybrid} = 𝛼 𝐿_{CVAE} + (1 - 𝛼) 𝐿_{GSNN}$ .

1.1.2. $𝛽$ -VAE

$ℒ = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [𝐸_{𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥)} [\log 𝑝_{𝜃} (𝑥 | 𝑧)]] - 𝛽 KL (𝑞_{𝜑} (𝑧 | 𝑥) ‖ 𝑝 (𝑧))$ , when $𝛽 > 1$ , each dimension of $𝑧 \sim 𝑞_{𝜑} (𝑧 | 𝑥)$ are forced to be more independent (disentangled).

1.1.3. VAE with Discrete Latent

1.1.3.1. Gumbel-Softmax

Gumbel Max is a way to sample from a categorical distribution. We assume the probability of each category is $𝑝_{𝑖}$ , then $\arg \max_{𝑖} (\log 𝑝_{𝑖} - \log (- \log 𝜀_{𝑖})), 𝜀_{𝑖} \sim 𝑈 [0, 1]$ is equivalent to sampling from the categorical distribution, which is a reparametrization trick.

But $\arg \max$ is not differentiable, so we use softmax to approximate it:

softmax (\frac{\log 𝑝_{𝑖} - \log (- \log 𝜀_{𝑖})}{𝜏}), 𝜀_{𝑖} \sim 𝑈 [0, 1]

Where $𝜏$ is a temperature parameter. The smaller $𝜏$ , the more likely the result is to be one-hot.

Using Gumbel-Softmax, we can use $𝑝 (𝑧) = uniform (0, 𝑘 - 1)$ instead of $𝑝 (𝑧) = 𝒩 (0, 1)$ .

1.1.3.2. Vector-Quantization VAE (VQ-VAE)

Reduce dimensions and use PixelCNN to generate images.

In reality, we encoder $𝑥$ into a $𝑚 \times 𝑚$ grid of $𝑑$ -dimensional vectors. But $\arg \min$ is not differentiable, so we use Straight-Through Estimator to define our own gradient and change loss function:

{‖ 𝑥 - decoder (𝑧_{𝑞}) ‖}_{2}^{2} \to {‖ 𝑥 - decoder (𝑧 + sg [𝑧_{𝑞} - 𝑧]) ‖}_{2}^{2}

To make $𝑧_{𝑞}$ more similar to $𝑧$ , we can add ${‖ 𝑧 - 𝑧_{𝑞} ‖}_{2}^{2}$ to the loss function. Decompose ${‖ 𝑧_{𝑞} - 𝑧 ‖}_{2}^{2}$ into ${‖ sg [𝑧] - 𝑧_{𝑞} ‖}_{2}^{2} + {‖ 𝑧 - sg [𝑧] ‖}_{2}^{2}$ . The first term fixes $𝑧$ and makes $𝑧_{𝑞}$ closer to $𝑧$ and the second term makes $𝑧$ closer to $𝑧_{𝑞}$ . Since $𝑧_{𝑞}$ is more free to change, so the loss function is:

{‖ 𝑥 - decoder (𝑧 + sg [𝑧_{𝑞} - 𝑧]) ‖}_{2}^{2} + 𝛽 {‖ sg [𝑧] - 𝑧_{𝑞} ‖}_{2}^{2} + 𝛾 {‖ 𝑧 - sg [𝑧_{𝑞}] ‖}_{2}^{2}

where $𝛾 < 𝛽$ . After training, we can use $𝑝 (𝑧)$ to train auto-regressive models like PixelCNN for better sampling.

1.1.3.3. VQ-VAE 2

Bi-level VQ-VAE, bottom level conditions on top level.

1.1.3.4. DALL-E

Discrete VAE using ResNet with 8192 codebook size & 1024 image tokens.

1.1.3.5. DALL-E 2/3

Image generation model over image embeddings.

1.1.3.6. Latent Diffusion Models (LDM)

dVAE + Transformer prior over large-scale text-image paired data

🔒 Access Restricted

Access Control