Elucidating the Design Space of Diffusion-Based Generative Models

June 27, 2025

by Leonardo

1. Elucidating the Design Space of Diffusion-Based Generative Models

Elucidating the Design Space of Diffusion-Based Generative Models (EDM) provides the first comprehensive theoretical and empirical analysis of the design choices that define these powerful generative systems.

1.1. Unified Mathematical Framework

1.1.1. The Central Insight: Denoising Score Matching

The fundamental insight of EDM is that diffusion models can be understood through the lens of denoising score matching. Given a data distribution $𝑝_{data} (𝒙)$ , we consider the family of mollified distributions:

𝑝 (𝒙; 𝜎) = \int 𝑝_{data} (𝒚) 𝒩 (𝒙; 𝒚, 𝜎^{2} 𝑰) 𝑑 𝒚

This represents the data distribution corrupted by Gaussian noise with standard deviation $𝜎$ . The key mathematical insight is that the score function $\nabla_{𝒙} \log 𝑝 (𝒙; 𝜎)$ can be learned through denoising:

\nabla_{𝒙} \log 𝑝 (𝒙; 𝜎) = \frac{𝐷 (𝒙; 𝜎) - 𝒙}{𝜎^{2}}

where $𝐷 (𝒙; 𝜎)$ is the optimal denoiser that minimizes:

𝔼_{𝒚 \sim 𝑝_{data}} 𝔼_{𝒏 \sim 𝒩 (𝟎, 𝜎^{2} 𝑰)} ‖ 𝐷 (𝒚 + 𝒏; 𝜎) - 𝒚 ‖_{2}^{2}

1.1.2. Probability Flow ODE

EDM reformulates the diffusion process as a deterministic probability flow ODE. The evolution of samples is governed by:

\frac{𝑑 𝒙}{𝑑 𝑡} = - \dot{𝜎} (𝑡) 𝜎 (𝑡) \nabla_{𝒙} \log 𝑝 (𝒙; 𝜎 (𝑡))

where $𝜎 (𝑡)$ is a noise schedule and $\dot{𝜎} (𝑡)$ is its time derivative. Substituting the denoising connection:

\frac{𝑑 𝒙}{𝑑 𝑡} = \frac{\dot{𝜎} (𝑡)}{𝜎 (𝑡)} (𝒙 - 𝐷 (𝒙; 𝜎 (𝑡)))

Intuition: The ODE continuously moves samples toward the denoised estimate. At each time step, the network predicts what the clean image should be, and the ODE moves the current sample in that direction.

1.1.3. The Choice of $𝜎 (𝑡) = 𝑡$

EDM demonstrates that setting $𝜎 (𝑡) = 𝑡$ leads to particularly well-behaved trajectories. With this choice, the ODE simplifies to:

\frac{𝑑 𝒙}{𝑑 𝑡} = \frac{𝒙 - 𝐷 (𝒙; 𝑡)}{𝑡}

Key insight: A single Euler step from any point $(𝒙, 𝑡)$ to $𝑡 = 0$ yields exactly the denoiser output $𝐷 (𝒙; 𝑡)$ . This means the ODE tangent always points toward the denoised image, creating nearly linear solution trajectories that are numerically stable.

1.2. Preconditioning: The Heart of EDM

1.2.1. The Problem with Naive Training

Training a network to directly predict $𝐷 (𝒙; 𝜎)$ is problematic because:

Input magnitude $‖ 𝒙 ‖$ varies dramatically with noise level $𝜎$
Output targets range from noisy images (high $𝜎$ ) to clean images (low $𝜎$ )
Gradient magnitudes vary wildly across different $𝜎$ values

1.2.2. EDM's Preconditioning Solution

Instead of learning $𝐷$ directly, EDM proposes learning a preconditioned network:

𝐷_{𝜃} (𝒙; 𝜎) = 𝑐_{skip} (𝜎) 𝒙 + 𝑐_{out} (𝜎) 𝐹_{𝜃} (𝑐_{in} (𝜎) 𝒙; 𝑐_{noise} (𝜎))

where $𝐹_{𝜃}$ is the actual neural network and the $𝑐$ functions are deterministic preconditioning functions.

1.2.3. Deriving the Preconditioning Functions

EDM derives these functions from first principles:

Input scaling $𝑐_{in} (𝜎)$ : Normalizes input to unit variance

𝑐_{in} (𝜎) = \frac{1}{\sqrt{𝜎^{2} + 𝜎_{data}^{2}}}

Output scaling $𝑐_{out} (𝜎)$ : Balances signal and noise components

𝑐_{out} (𝜎) = \frac{𝜎 \cdot 𝜎_{data}}{\sqrt{𝜎_{data}^{2} + 𝜎^{2}}}

Skip connection $𝑐_{skip} (𝜎)$ : Preserves input information optimally

𝑐_{skip} (𝜎) = \frac{𝜎_{data}^{2}}{𝜎^{2} + 𝜎_{data}^{2}}

Intuition: At low noise ( $𝜎 \to 0$ ), the skip connection dominates and the network only needs to remove small perturbations. At high noise ( $𝜎 \to \infty$ ), the network output dominates and learns to extract signal from pure noise.

1.3. Improvements to the Sampling Process

EDM argues that the sampling process is largely independent of the network's training and can be optimized as a standalone component. The key improvements are:

1.3.1. Deterministic Sampling

2nd-Order ODE Solver: By replacing the standard 1st-order Euler solver with a 2nd-order Heun method, the sampler can take larger, more accurate steps along the solution trajectory.

Intuition: A 1st-order solver assumes the direction (dx/dt) is constant over a step. A 2nd-order solver looks ahead, corrects for the changing direction, and thus follows the curved path more faithfully. This drastically reduces the number of steps (NFE) needed for high quality.
Time Step Discretization: The paper shows that concentrating sampling steps in the low-noise regime is critical for perceptual quality. A polynomial schedule with $𝜌 = 7$ is chosen empirically to focus the sampler's "effort" where it matters most.

Intuition: Errors made at high noise levels (blurry, abstract shapes) are less visually damaging than errors made at low noise levels (fine details, textures). Therefore, we should take careful, small steps when the image is almost finished.

1.3.2. Stochastic Sampling

While deterministic sampling is efficient, stochastic sampling can correct errors and often yields better FID scores.

EDM introduces a custom sampler that first adds a controlled amount of noise (a "churn" step) and then takes a 2nd-order deterministic step to denoise. This process is carefully controlled with heuristics to prevent image degradation, such as limiting stochasticity to a specific noise range $[𝑆_{tmin}, 𝑆_{tmax}]$ .

1.4. Training Objective and Loss Weighting

1.4.1. The Effective Training Target

With preconditioning, the effective training target for $𝐹_{𝜃}$ becomes:

Target = \frac{1}{𝑐_{out} (𝜎)} (𝒚 - 𝑐_{skip} (𝜎) (𝒚 + 𝒏))

This simplifies to learning the normalized noise residual rather than the absolute denoised image.

1.4.2. Optimal Loss Weighting

EDM derives the loss weighting that equalizes training emphasis across noise levels:

𝜆 (𝜎) = \frac{𝜎^{2} + 𝜎_{data}^{2}}{{(𝜎 \cdot 𝜎_{data})}^{2}}

Mathematical insight: This weighting ensures that errors in $𝐹_{𝜃}$ are amplified equally regardless of noise level, preventing the training from being dominated by any particular $𝜎$ range.

1.4.3. Log-Normal Noise Distribution

Instead of uniform sampling of $𝜎$ , EDM uses a log-normal distribution:

\log 𝜎 \sim 𝒩 (𝑃_{mean}, 𝑃_{std}^{2})

Rationale: Training loss is only reducible in intermediate noise ranges. At very low noise, the signal is nearly clean (nothing to learn). At very high noise, the target approaches the dataset mean (little structure to capture).

1.4.4. Non-leaky Augmentation

To combat overfitting, especially on smaller datasets, EDM employs a conditional augmentation pipeline. Geometric transformations are applied to training images, and the transformation parameters are fed to the network as a condition.

References