Elucidating the Design Space of Diffusion-Based Generative Models

1. Elucidating the Design Space of Diffusion-Based Generative Models

Elucidating the Design Space of Diffusion-Based Generative Models (EDM) provides the first comprehensive theoretical and empirical analysis of the design choices that define these powerful generative systems.

1.1. Unified Mathematical Framework

1.1.1. The Central Insight: Denoising Score Matching

The fundamental insight of EDM is that diffusion models can be understood through the lens of denoising score matching. Given a data distribution 𝑝dataΒ (𝒙), we consider the family of mollified distributions:

𝑝(𝒙;𝜎)=βˆ«π‘Β dataΒ (π’š)𝒩(𝒙;π’š,𝜎2𝑰)π‘‘π’š

This represents the data distribution corrupted by Gaussian noise with standard deviation 𝜎. The key mathematical insight is that the score function βˆ‡π’™log𝑝(𝒙;𝜎) can be learned through denoising:

βˆ‡π’™log𝑝(𝒙;𝜎)=𝐷(𝒙;𝜎)βˆ’π’™πœŽ2

where 𝐷(𝒙;𝜎) is the optimal denoiser that minimizes:

π”Όπ’šβˆΌπ‘Β dataπ”Όπ’βˆΌπ’©(𝟎,𝜎2𝑰)‖𝐷(π’š+𝒏;𝜎)βˆ’π’šβ€–22

1.1.2. Probability Flow ODE

EDM reformulates the diffusion process as a deterministic probability flow ODE. The evolution of samples is governed by:

𝑑𝒙𝑑𝑑=βˆ’πœŽΜ‡(𝑑)𝜎(𝑑)βˆ‡π’™log𝑝(𝒙;𝜎(𝑑))

where 𝜎(𝑑) is a noise schedule and πœŽΜ‡(𝑑) is its time derivative. Substituting the denoising connection:

𝑑𝒙𝑑𝑑=πœŽΜ‡(𝑑)𝜎(𝑑)(π’™βˆ’π·(𝒙;𝜎(𝑑)))

Intuition: The ODE continuously moves samples toward the denoised estimate. At each time step, the network predicts what the clean image should be, and the ODE moves the current sample in that direction.

1.1.3. The Choice of 𝜎(𝑑)=𝑑

EDM demonstrates that setting 𝜎(𝑑)=𝑑 leads to particularly well-behaved trajectories. With this choice, the ODE simplifies to:

𝑑𝒙𝑑𝑑=π’™βˆ’π·(𝒙;𝑑)𝑑

Key insight: A single Euler step from any point (𝒙,𝑑) to 𝑑=0 yields exactly the denoiser output 𝐷(𝒙;𝑑). This means the ODE tangent always points toward the denoised image, creating nearly linear solution trajectories that are numerically stable.

1.2. Preconditioning: The Heart of EDM

1.2.1. The Problem with Naive Training

Training a network to directly predict 𝐷(𝒙;𝜎) is problematic because:

  • Input magnitude ‖𝒙‖ varies dramatically with noise level 𝜎
  • Output targets range from noisy images (high 𝜎) to clean images (low 𝜎)
  • Gradient magnitudes vary wildly across different 𝜎 values

1.2.2. EDM's Preconditioning Solution

Instead of learning 𝐷 directly, EDM proposes learning a preconditioned network:

π·πœƒ(𝒙;𝜎)=𝑐 skipΒ (𝜎)𝒙+𝑐 outΒ (𝜎)πΉπœƒ(𝑐inΒ (𝜎)𝒙;𝑐 noiseΒ (𝜎))

where πΉπœƒ is the actual neural network and the 𝑐 functions are deterministic preconditioning functions.

1.2.3. Deriving the Preconditioning Functions

EDM derives these functions from first principles:

Input scaling 𝑐inΒ (𝜎): Normalizes input to unit variance

𝑐inΒ (𝜎)=1𝜎2+𝜎 data2

Output scaling 𝑐outΒ (𝜎): Balances signal and noise components

𝑐outΒ (𝜎)=πœŽβ‹…πœŽΒ data𝜎dataΒ 2+𝜎2

Skip connection 𝑐skipΒ (𝜎): Preserves input information optimally

𝑐skipΒ (𝜎)=𝜎 data2𝜎2+𝜎 data2

Intuition: At low noise (πœŽβ†’0), the skip connection dominates and the network only needs to remove small perturbations. At high noise (πœŽβ†’βˆž), the network output dominates and learns to extract signal from pure noise.

1.3. Improvements to the Sampling Process

EDM argues that the sampling process is largely independent of the network's training and can be optimized as a standalone component. The key improvements are:

1.3.1. Deterministic Sampling

  • 2nd-Order ODE Solver: By replacing the standard 1st-order Euler solver with a 2nd-order Heun method, the sampler can take larger, more accurate steps along the solution trajectory.

    Intuition: A 1st-order solver assumes the direction (dx/dt) is constant over a step. A 2nd-order solver looks ahead, corrects for the changing direction, and thus follows the curved path more faithfully. This drastically reduces the number of steps (NFE) needed for high quality.

  • Time Step Discretization: The paper shows that concentrating sampling steps in the low-noise regime is critical for perceptual quality. A polynomial schedule with 𝜌=7 is chosen empirically to focus the sampler's "effort" where it matters most.

    Intuition: Errors made at high noise levels (blurry, abstract shapes) are less visually damaging than errors made at low noise levels (fine details, textures). Therefore, we should take careful, small steps when the image is almost finished.

1.3.2. Stochastic Sampling

While deterministic sampling is efficient, stochastic sampling can correct errors and often yields better FID scores.

EDM introduces a custom sampler that first adds a controlled amount of noise (a "churn" step) and then takes a 2nd-order deterministic step to denoise. This process is carefully controlled with heuristics to prevent image degradation, such as limiting stochasticity to a specific noise range [𝑆tmin,𝑆 tmax].

1.4. Training Objective and Loss Weighting

1.4.1. The Effective Training Target

With preconditioning, the effective training target for πΉπœƒ becomes:

TargetΒ =1𝑐outΒ (𝜎)(π’šβˆ’π‘Β skipΒ (𝜎)(π’š+𝒏))

This simplifies to learning the normalized noise residual rather than the absolute denoised image.

1.4.2. Optimal Loss Weighting

EDM derives the loss weighting that equalizes training emphasis across noise levels:

πœ†(𝜎)=𝜎2+𝜎 data2(πœŽβ‹…πœŽΒ data)2

Mathematical insight: This weighting ensures that errors in πΉπœƒ are amplified equally regardless of noise level, preventing the training from being dominated by any particular 𝜎 range.

1.4.3. Log-Normal Noise Distribution

Instead of uniform sampling of 𝜎, EDM uses a log-normal distribution:

logπœŽβˆΌπ’©(𝑃mean,𝑃 std2)

Rationale: Training loss is only reducible in intermediate noise ranges. At very low noise, the signal is nearly clean (nothing to learn). At very high noise, the target approaches the dataset mean (little structure to capture).

1.4.4. Non-leaky Augmentation

To combat overfitting, especially on smaller datasets, EDM employs a conditional augmentation pipeline. Geometric transformations are applied to training images, and the transformation parameters are fed to the network as a condition.

References

  1. Elucidating the Design Space of Diffusion-Based Generative Models