Implicit Generative Models

1. Use implicit generative models

We define a prior distribution 𝑝𝑧(𝑧) and a generative model 𝐺(𝑧) to generate samples. Thus we have 𝑝𝑔(π‘₯)=βˆ«π‘π‘§(𝑧)𝛿(π‘₯βˆ’πΊ(𝑧))𝑑𝑧. If we view π‘ž(𝑧)=𝑝𝑧(𝑧) and π‘ž(π‘₯|𝑧)=𝛿(π‘₯βˆ’πΊ(𝑧)), then we are doing variational inference.

1.1. GAN

We want to have some metric to measure the similarity between 𝑝𝑔(π‘₯) and 𝑝dataΒ (π‘₯). GAN uses a discriminator to measure the similarity.

  • Generator 𝐺: Maps random noise 𝑧 to data-like samples 𝐺(𝑧)
  • Discriminator 𝐷: Classifies samples as real (from data) or fake (from generator)

1.1.1. The Discriminator's Goal

The discriminator wants to:

  • Output high probability for real data: maximize log𝐷(π‘₯) for π‘₯βˆΌπ‘Β dataΒ (π‘₯)
  • Output low probability for fake data: maximize log(1βˆ’π·(𝐺(𝑧))) for π‘§βˆΌπ‘π‘§(𝑧)

Combined objective for discriminator:

max𝐷 𝐸π‘₯βˆΌπ‘Β dataΒ (π‘₯)[log𝐷(π‘₯)]+πΈπ‘§βˆΌπ‘π‘§(𝑧)[log(1βˆ’π·(𝐺(𝑧)))]

1.1.2. The Generator's Goal

The generator wants to fool the discriminator:

  • Make 𝐷(𝐺(𝑧)) as close to 1 as possible
  • Minimize log(1βˆ’π·(𝐺(𝑧)))

Generator objective:

min𝐺 πΈπ‘§βˆΌπ‘π‘§(𝑧)[log(1βˆ’π·(𝐺(𝑧)))]

Combining both objectives, we get the famous minimax game:

min𝐺max𝐷𝑉(𝐺,𝐷)=𝐸π‘₯βˆΌπ‘Β dataΒ (π‘₯)[log𝐷(π‘₯)]+πΈπ‘§βˆΌπ‘π‘§(𝑧)[log(1βˆ’π·(𝐺(𝑧)))]

We can't solve the minimax problem directly, so we alternate:

  1. Fix G, train D: Given current generator, train discriminator to distinguish real from fake
  2. Fix D, train G: Given current discriminator, train generator to fool it
  3. Repeat: This creates an arms race that drives both models to improve

Question: What happens at the optimal solution?

At equilibrium, the generator should recover the data distribution: 𝑝𝑔=𝑝 data.

Proof sketch:

  1. For fixed 𝐺, the optimal discriminator is:

    π·πΊβˆ—(π‘₯)=𝑝dataΒ (π‘₯)𝑝dataΒ (π‘₯)+𝑝𝑔(π‘₯)
  2. Substituting this back into the objective:

    𝐢(𝐺)=max𝐷𝑉(𝐺,𝐷)=𝐸π‘₯βˆΌπ‘Β data[log𝑝dataΒ (π‘₯)𝑝dataΒ (π‘₯)+𝑝𝑔(π‘₯)]+𝐸π‘₯βˆΌπ‘π‘”[log𝑝𝑔(π‘₯)𝑝dataΒ (π‘₯)+𝑝𝑔(π‘₯)]
  3. This can be rewritten as:

    𝐢(𝐺)=βˆ’log(4)+2Β·JSD(𝑝data ‖𝑝𝑔)

    where JSD is the Jensen-Shannon divergence.

  4. Since JSDΒ β‰₯0, the global minimum is achieved when 𝑝𝑔=𝑝 data.

1.1.3. Evaluation of GAN

  1. Inception Score (IS)

𝑝(𝑦|π‘₯)=Β Inception-v3Β (π‘₯),𝑝(𝑦)=𝐸π‘₯[𝑝(𝑦|π‘₯)],Β ISΒ =exp(𝐸π‘₯[KL(𝑝(𝑦|π‘₯)‖𝑝(𝑦))]), the higher the better.

IS only measures the quality of the generated samples, but we also want the distribution to be similar to 𝑝data. If the model just memorize the training data, it will also have high IS.

  1. FrΓ©chet Inception Distance (FID)

FID measures the distance between real and generated image distributions in Inception feature space.

For both real images π‘‹π‘Ÿ={π‘₯1π‘Ÿ,π‘₯2π‘Ÿ,…,π‘₯π‘π‘Ÿ} and generated images 𝑋𝑔={π‘₯1𝑔,π‘₯2𝑔,…,π‘₯𝑀𝑔}:

πΉπ‘Ÿ=Inception(π‘‹π‘Ÿ)βˆˆβ„π‘Γ—2048,𝐹𝑔=Inception(𝑋𝑔)βˆˆβ„π‘€Γ—2048

Assume features follow multivariate Gaussian distributions:

πΉπ‘ŸβˆΌπ’©(πœ‡π‘Ÿ,Ξ£π‘Ÿ)Β andΒ πΉπ‘”βˆΌπ’©(πœ‡π‘”,Σ𝑔)

Compute sample statistics:

πœ‡π‘Ÿ=1π‘βˆ‘π‘–=1π‘π‘“π‘–π‘Ÿ,Ξ£π‘Ÿ=Cov(πΉπ‘Ÿ),πœ‡π‘”=1π‘€βˆ‘π‘—=1𝑀𝑓𝑗𝑔,Σ𝑔=Cov(𝐹𝑔)

The distance between two multivariate Gaussians:

FIDΒ =β€–πœ‡π‘Ÿβˆ’πœ‡π‘”β€–2+Tr(Ξ£π‘Ÿ+Ξ£π‘”βˆ’2(Ξ£π‘ŸΞ£π‘”)12)
Component Meaning
β€–πœ‡π‘Ÿβˆ’πœ‡π‘”β€–2 Difference in central tendency
(Are generated images "on average" similar to real ones?)
Tr(Ξ£π‘Ÿ+Ξ£π‘”βˆ’2(Ξ£π‘ŸΞ£π‘”)12) Difference in variation patterns
(Do generated images have similar diversity and correlations?)

Lower is better: FID = 0 when πœ‡π‘Ÿ=πœ‡π‘” and Ξ£π‘Ÿ=Σ𝑔. FID values are dataset-dependent. Compare only within the same dataset and experimental setup.

1.1.4. Deep Convolutional GANs (DCGAN)

  • Use fully convolutional layers
  • Use batch normalization to stabilize training
  • Avoid ReLU activation in discriminator to avoid gradient vanishing
  • Use small learning rate and momentum

1.1.5. Improved Training Techniques for GAN

  • Feature Matching
  • Minibatch discrimination
  • Historical averaging
  • One-sided label smoothing
  • Virtual batch normalization

1.1.6. Wasserstein GANs (WGANs)

When using JS divergence, gradient vanishes when 𝑝data and 𝑝𝑔 are fully disjoint and this will cause mode collapse. So we use Wasserstein distance instead.

min𝐺max𝐷,‖𝐷‖𝐿≀1𝐸π‘₯βˆΌπ‘Β data[𝐷(π‘₯)]βˆ’πΈπ‘₯βˆΌπ‘πΊ[𝐷(π‘₯)]

But how to add Lipschitz constraint?

  1. Weight Clipping (original WGAN)

If each layer of the network has bounded weights, and we use activation functions that are Lipschitz (like ReLU or tanh), then the composition of these functions maintains the Lipschitz property.

  1. Gradient Penalty (WGAN-GP)
β„’=𝐸π‘₯βˆΌπ‘Β data[𝐷(π‘₯)]βˆ’πΈπ‘₯βˆΌπ‘πΊ[𝐷(π‘₯)]+πœ†β‹…πΈπ‘₯βˆΌπ‘Ÿ(π‘₯)[(β€–βˆ‡π‘₯𝐷(π‘₯)β€–βˆ’1)2]

where π‘Ÿ(π‘₯)=(1βˆ’πœ€)π‘₯dataΒ +πœ€π‘₯𝐺 with πœ€βˆΌπ‘ˆ[0,1].

1.1.7. BigGAN, GigaGAN, R3GAN, BiGAN, BigBiGAN

  • BigGAN: scaling up for quality, class-conditional batch normalization
  • GigaGAN: text-to-image at scale
  • R3GAN: Reliable, Realistic, and Robust
  • BiGAN: Bidirectional GAN
  • BigBiGAN: Scaling Bidirectional Learning