Implicit Generative Models

June 7, 2025

by Leonardo

1. Use implicit generative models

We define a prior distribution $𝑝_{𝑧} (𝑧)$ and a generative model $𝐺 (𝑧)$ to generate samples. Thus we have $𝑝_{𝑔} (𝑥) = \int 𝑝_{𝑧} (𝑧) 𝛿 (𝑥 - 𝐺 (𝑧)) 𝑑 𝑧$ . If we view $𝑞 (𝑧) = 𝑝_{𝑧} (𝑧)$ and $𝑞 (𝑥 | 𝑧) = 𝛿 (𝑥 - 𝐺 (𝑧))$ , then we are doing variational inference.

1.1. GAN

We want to have some metric to measure the similarity between $𝑝_{𝑔} (𝑥)$ and $𝑝_{data} (𝑥)$ . GAN uses a discriminator to measure the similarity.

Generator $𝐺$ : Maps random noise $𝑧$ to data-like samples $𝐺 (𝑧)$
Discriminator $𝐷$ : Classifies samples as real (from data) or fake (from generator)

1.1.1. The Discriminator's Goal

The discriminator wants to:

Output high probability for real data: maximize $\log 𝐷 (𝑥)$ for $𝑥 \sim 𝑝_{data} (𝑥)$
Output low probability for fake data: maximize $\log (1 - 𝐷 (𝐺 (𝑧)))$ for $𝑧 \sim 𝑝_{𝑧} (𝑧)$

Combined objective for discriminator:

\max_{𝐷} 𝐸_{𝑥 \sim 𝑝_{data} (𝑥)} [\log 𝐷 (𝑥)] + 𝐸_{𝑧 \sim 𝑝_{𝑧} (𝑧)} [\log (1 - 𝐷 (𝐺 (𝑧)))]

1.1.2. The Generator's Goal

The generator wants to fool the discriminator:

Make $𝐷 (𝐺 (𝑧))$ as close to 1 as possible
Minimize $\log (1 - 𝐷 (𝐺 (𝑧)))$

Generator objective:

\min_{𝐺} 𝐸_{𝑧 \sim 𝑝_{𝑧} (𝑧)} [\log (1 - 𝐷 (𝐺 (𝑧)))]

Combining both objectives, we get the famous minimax game:

\min_{𝐺} \max_{𝐷} 𝑉 (𝐺, 𝐷) = 𝐸_{𝑥 \sim 𝑝_{data} (𝑥)} [\log 𝐷 (𝑥)] + 𝐸_{𝑧 \sim 𝑝_{𝑧} (𝑧)} [\log (1 - 𝐷 (𝐺 (𝑧)))]

We can't solve the minimax problem directly, so we alternate:

Fix G, train D: Given current generator, train discriminator to distinguish real from fake
Fix D, train G: Given current discriminator, train generator to fool it
Repeat: This creates an arms race that drives both models to improve

Question: What happens at the optimal solution?

At equilibrium, the generator should recover the data distribution: $𝑝_{𝑔} = 𝑝_{data}$ .

Proof sketch:

For fixed $𝐺$ , the optimal discriminator is:
$𝐷_{𝐺}^{*} (𝑥) = \frac{𝑝_{data} (𝑥)}{𝑝_{data} (𝑥) + 𝑝_{𝑔} (𝑥)}$
Substituting this back into the objective:
$𝐶 (𝐺) = \max_{𝐷} 𝑉 (𝐺, 𝐷) = 𝐸_{𝑥 \sim 𝑝_{data}} [\log \frac{𝑝_{data} (𝑥)}{𝑝_{data} (𝑥) + 𝑝_{𝑔} (𝑥)}] + 𝐸_{𝑥 \sim 𝑝_{𝑔}} [\log \frac{𝑝_{𝑔} (𝑥)}{𝑝_{data} (𝑥) + 𝑝_{𝑔} (𝑥)}]$
This can be rewritten as:
$𝐶 (𝐺) = - \log (4) + 2 \cdot JSD (𝑝_{data} ‖ 𝑝_{𝑔})$
where JSD is the Jensen-Shannon divergence.
Since $JSD \geq 0$ , the global minimum is achieved when $𝑝_{𝑔} = 𝑝_{data}$ .

1.1.3. Evaluation of GAN

Inception Score (IS)

$𝑝 (𝑦 | 𝑥) = Inception-v3 (𝑥), 𝑝 (𝑦) = 𝐸_{𝑥} [𝑝 (𝑦 | 𝑥)], IS = \exp (𝐸_{𝑥} [KL (𝑝 (𝑦 | 𝑥) ‖ 𝑝 (𝑦))])$ , the higher the better.

IS only measures the quality of the generated samples, but we also want the distribution to be similar to $𝑝_{data}$ . If the model just memorize the training data, it will also have high IS.

Fréchet Inception Distance (FID)

FID measures the distance between real and generated image distributions in Inception feature space.

For both real images $𝑋_{𝑟} = {𝑥_{1}^{𝑟}, 𝑥_{2}^{𝑟}, \dots, 𝑥_{𝑁}^{𝑟}}$ and generated images $𝑋_{𝑔} = {𝑥_{1}^{𝑔}, 𝑥_{2}^{𝑔}, \dots, 𝑥_{𝑀}^{𝑔}}$ :

𝐹_{𝑟} = Inception (𝑋_{𝑟}) \in ℝ^{𝑁 \times 2048}, 𝐹_{𝑔} = Inception (𝑋_{𝑔}) \in ℝ^{𝑀 \times 2048}

Assume features follow multivariate Gaussian distributions:

𝐹_{𝑟} \sim 𝒩 (𝜇_{𝑟}, Σ_{𝑟}) and 𝐹_{𝑔} \sim 𝒩 (𝜇_{𝑔}, Σ_{𝑔})

Compute sample statistics:

𝜇_{𝑟} = \frac{1}{𝑁} \sum_{𝑖 = 1}^{𝑁} 𝑓_{𝑖}^{𝑟}, Σ_{𝑟} = Cov (𝐹_{𝑟}), 𝜇_{𝑔} = \frac{1}{𝑀} \sum_{𝑗 = 1}^{𝑀} 𝑓_{𝑗}^{𝑔}, Σ_{𝑔} = Cov (𝐹_{𝑔})

The distance between two multivariate Gaussians:

FID = ‖ 𝜇_{𝑟} - 𝜇_{𝑔} ‖^{2} + Tr (Σ_{𝑟} + Σ_{𝑔} - 2 {(Σ_{𝑟} Σ_{𝑔})}^{\frac{1}{2}})

Component	Meaning
$‖ 𝜇_{𝑟} - 𝜇_{𝑔} ‖^{2}$	Difference in central tendency (Are generated images "on average" similar to real ones?)
$Tr (Σ_{𝑟} + Σ_{𝑔} - 2 {(Σ_{𝑟} Σ_{𝑔})}^{\frac{1}{2}})$	Difference in variation patterns (Do generated images have similar diversity and correlations?)

Lower is better: FID = 0 when $𝜇_{𝑟} = 𝜇_{𝑔}$ and $Σ_{𝑟} = Σ_{𝑔}$ . FID values are dataset-dependent. Compare only within the same dataset and experimental setup.

1.1.4. Deep Convolutional GANs (DCGAN)

Use fully convolutional layers
Use batch normalization to stabilize training
Avoid ReLU activation in discriminator to avoid gradient vanishing
Use small learning rate and momentum

1.1.5. Improved Training Techniques for GAN

Feature Matching
Minibatch discrimination
Historical averaging
One-sided label smoothing
Virtual batch normalization

1.1.6. Wasserstein GANs (WGANs)

When using JS divergence, gradient vanishes when $𝑝_{data}$ and $𝑝_{𝑔}$ are fully disjoint and this will cause mode collapse. So we use Wasserstein distance instead.

\min_{𝐺} \max_{𝐷, {‖ 𝐷 ‖}_{𝐿} \leq 1} 𝐸_{𝑥 \sim 𝑝_{data}} [𝐷 (𝑥)] - 𝐸_{𝑥 \sim 𝑝_{𝐺}} [𝐷 (𝑥)]

But how to add Lipschitz constraint?

Weight Clipping (original WGAN)

If each layer of the network has bounded weights, and we use activation functions that are Lipschitz (like ReLU or tanh), then the composition of these functions maintains the Lipschitz property.

Gradient Penalty (WGAN-GP)

ℒ = 𝐸_{𝑥 \sim 𝑝_{data}} [𝐷 (𝑥)] - 𝐸_{𝑥 \sim 𝑝_{𝐺}} [𝐷 (𝑥)] + 𝜆 \cdot 𝐸_{𝑥 \sim 𝑟 (𝑥)} [{(‖ \nabla_{𝑥} 𝐷 (𝑥) ‖ - 1)}^{2}]

where $𝑟 (𝑥) = (1 - 𝜀) 𝑥_{data} + 𝜀 𝑥_{𝐺}$ with $𝜀 \sim 𝑈 [0, 1]$ .

1.1.7. BigGAN, GigaGAN, R3GAN, BiGAN, BigBiGAN

BigGAN: scaling up for quality, class-conditional batch normalization
GigaGAN: text-to-image at scale
R3GAN: Reliable, Realistic, and Robust
BiGAN: Bidirectional GAN
BigBiGAN: Scaling Bidirectional Learning

🔒 Access Restricted

Access Control