Implicit Generative Models
1. Use implicit generative models
We define a prior distribution and a generative model to generate samples. Thus we have . If we view and , then we are doing variational inference.
1.1. GAN
We want to have some metric to measure the similarity between and . GAN uses a discriminator to measure the similarity.
- Generator : Maps random noise to data-like samples
- Discriminator : Classifies samples as real (from data) or fake (from generator)
1.1.1. The Discriminator's Goal
The discriminator wants to:
- Output high probability for real data: maximize for
- Output low probability for fake data: maximize for
Combined objective for discriminator:
1.1.2. The Generator's Goal
The generator wants to fool the discriminator:
- Make as close to 1 as possible
- Minimize
Generator objective:
Combining both objectives, we get the famous minimax game:
We can't solve the minimax problem directly, so we alternate:
- Fix G, train D: Given current generator, train discriminator to distinguish real from fake
- Fix D, train G: Given current discriminator, train generator to fool it
- Repeat: This creates an arms race that drives both models to improve
Question: What happens at the optimal solution?
At equilibrium, the generator should recover the data distribution: .
Proof sketch:
-
For fixed , the optimal discriminator is:
-
Substituting this back into the objective:
-
This can be rewritten as:
where JSD is the Jensen-Shannon divergence.
-
Since , the global minimum is achieved when .
1.1.3. Evaluation of GAN
- Inception Score (IS)
, the higher the better.
IS only measures the quality of the generated samples, but we also want the distribution to be similar to . If the model just memorize the training data, it will also have high IS.
- FrΓ©chet Inception Distance (FID)
FID measures the distance between real and generated image distributions in Inception feature space.
For both real images and generated images :
Assume features follow multivariate Gaussian distributions:
Compute sample statistics:
The distance between two multivariate Gaussians:
| Component | Meaning |
| Difference in central tendency (Are generated images "on average" similar to real ones?) |
|
| Difference in variation patterns (Do generated images have similar diversity and correlations?) |
Lower is better: FID = 0 when and . FID values are dataset-dependent. Compare only within the same dataset and experimental setup.
1.1.4. Deep Convolutional GANs (DCGAN)
- Use fully convolutional layers
- Use batch normalization to stabilize training
- Avoid ReLU activation in discriminator to avoid gradient vanishing
- Use small learning rate and momentum
1.1.5. Improved Training Techniques for GAN
- Feature Matching
- Minibatch discrimination
- Historical averaging
- One-sided label smoothing
- Virtual batch normalization
1.1.6. Wasserstein GANs (WGANs)
When using JS divergence, gradient vanishes when and are fully disjoint and this will cause mode collapse. So we use Wasserstein distance instead.
But how to add Lipschitz constraint?
- Weight Clipping (original WGAN)
If each layer of the network has bounded weights, and we use activation functions that are Lipschitz (like ReLU or tanh), then the composition of these functions maintains the Lipschitz property.
- Gradient Penalty (WGAN-GP)
where with .
1.1.7. BigGAN, GigaGAN, R3GAN, BiGAN, BigBiGAN
- BigGAN: scaling up for quality, class-conditional batch normalization
- GigaGAN: text-to-image at scale
- R3GAN: Reliable, Realistic, and Robust
- BiGAN: Bidirectional GAN
- BigBiGAN: Scaling Bidirectional Learning