Score-Based Models
1. Use score-based models
Score-Matching Langevin Dynamics (SMLD)
Langevin equation:
Stein's score function: . Do not confuse this with the original score function . For example, if is a Gaussian distribution, then . Using the score function, we don't need any special architectures to make the normalizing constant tractable.
We minimize fisher divergence between the model and the data distribution:
The key challenge is the fact that the estimated score functions are inaccurate in low density regions, where few data points are available for computing the score matching objective.
1.0.1. Explicit Score Matching
Consider the classical kernel density estimation by defining a distribution , where is some hyperparameter for the kernel function and is the -th sample in the training set.
This is a smooth approximation of which is never known, so we can learn based on .
The explicit score matching loss is
The issue of explicit score matching is that the kernel density estimation is a fairly poor non-parameter estimation of the true distribution.
1.0.2. Implicit Score Matching
If the model for the score function is realized by a deep neural network, the trace operator can be difficult to compute, hence making the implicit score matching not scalable.
1.0.3. Denoising Score Matching
In the special case where , then where . So we have
The beauty about this equation is that it is highly interpretable. The quantity is effectively adding noise to a clean image . The score function is supposed to take this noisy image and predict the noise . Predicting noise is equivalent to denoising, because any denoised image plus the predicted noise will give us the noisy observation. Therefore, this equation is a denoising step.
And for up to a constant which is independent of the variable , it holds that .
The noise conditioned score network (NCSN) optimize the following loss:
where the individual loss function is defined according to the noise levels :
The coefficient function is often chosen as based on empirical findings. The noise level sequence often satisfies .
For inference, we use Annealed Langevin Dynamics:
where is the step size.
1.1. Denoising Diffusion Probabilistic Models (DDPM)
DDPM provides an alternative perspective on score-based models through the lens of variational inference. Instead of directly learning score functions, DDPM learns to reverse a fixed noising process.
1.1.1. The Forward Process
DDPM defines a fixed Markov chain that gradually destroys data structure: . Through clever reparameterization, we can sample directly: where and .
1.1.2. The Reverse Process
The key insight is to learn a reverse Markov chain that undoes the forward process: . By maximizing the ELBO, we arrive at a remarkably simple objective:
The model learns to predict the noise that was added at each timestep. The noise prediction in DDPM is equivalent to score estimation: . This reveals that DDPM is implicitly performing denoising score matching at multiple noise levels, just like NCSN but with a different parameterization.
1.2. Stochastic Differential Equation (SDE)
- Ordinary differential equation (ODE): , assuming the initial condition , the solution is . The differential form is .
- Stochastic differential equation (SDE) (SDE): where . We can define , then .
Forward Diffusion: .
Reverse Diffusion: where is the probability distribution of at time and is the Wiener process when time flows backward. Compared to the Langevin Dynamics, this gives us a more general framework and is time-continuous. By solving the estimated reverse SDE with numerical SDE solvers, we can simulate the reverse stochastic process for sample generation.
The forward sampling equation of DDPM can be written as . The reverse sampling equation of DDPM can be written as . This is Variance Preserving (VP) SDE.
The forward sampling equation of SMLD can be written as . The reverse sampling equation of SMLD can be written as . This is Variance Exploding (VE) SDE.
It's possible to convert any SDE into an ordinary differential equation (ODE) without changing its marginal distributions.
The corresponding ODE of an SDE is named probability flow ODE, given by
This is a completely deterministic evolution equation, yet it preserves the probabilistic structure of the original SDE.
Predictor-Corrector Samplers: The predictor step uses numerical ODE or SDE solvers to advance along the reverse-time trajectory. This step provides the "big picture" direction, essentially telling us where we should move next based on our learned score function. However, numerical solvers accumulate errors over time, and our score function estimates are imperfect, leading to gradual drift away from the true data distribution.
The corrector step addresses these imperfections by applying a few iterations of Langevin dynamics at the current time point. Since we know what the distribution should look like at any given time during the reverse process, we can use MCMC sampling to "correct" our current sample to better match that target distribution. This local refinement helps counteract the accumulated errors from the predictor steps.
1.3. Guidance Methods for Controllable Generation
1.3.1. Classifier Guidance
Classifier guidance enables conditional generation by incorporating a pre-trained classifier during sampling. The key insight is that we can decompose the conditional score function using Bayes' rule:
where:
- is the unconditional score function (learned by the diffusion model)
- is the gradient of the classifier's log probability
The modified reverse SDE becomes:
where is the guidance scale that controls the strength of conditioning.
Implementation Details:
- Train a classifier on noisy images at various noise levels
- During sampling, compute classifier gradients:
- Scale these gradients by and add to the unconditional score
1.3.2. CLIP Guidance
CLIP guidance is a strategy that replaces the standard classifier with a CLIP model to steer the diffusion process towards a text caption.
In particular, we perturb the reverse-process mean with the gradient of the dot product of the image and caption encodings with respect to the image:
where is the variance.
Similar to classifier guidance, we must train CLIP on noised images to obtain the correct gradient in the reverse process. Throughout our experiments, we use CLIP models that were explicitly trained to be noise-aware, which we refer to as noised CLIP models.
1.3.3. Classifier-Free Guidance
Classifier-free guidance achieves conditional generation without requiring a separate classifier. Instead, it trains a single diffusion model to handle both conditional and unconditional generation.
Training Procedure: During training, randomly drop the conditioning information with probability (typically 10-20%):
Sampling Procedure: The classifier-free guidance formula combines conditional and unconditional predictions:
This can be rewritten as:
Connection to Classifier Guidance: Classifier-free guidance implicitly learns the classifier gradient term:
1.4. Denoising Diffusion Implicit Models (DDIM)
While DDPM achieves excellent generation quality, its sampling process requires many steps (typically 1000), leading to slow inference. DDIM enables faster sampling by introducing a non-Markovian sampling process.
The key insight of DDIM is that given the marginal distributions of the forward process, there exist infinitely many reverse processes that can produce the same marginals. DDPM is just one special case (Markovian process).
1.4.1. Non-Markovian Forward Process
DDIM defines a more general forward process:
where the conditional distribution is:
Here is an adjustable parameter:
- When , it reduces to DDPM
- When , it becomes a completely deterministic process
1.4.2. DDIM Sampling Formula
The DDIM reverse sampling process is:
where provides optional randomness.
1.4.3. Accelerated Sampling
DDIM's most important contribution is skip-step sampling. We can define a subsequence where , then use:
This reduces 1000-step sampling to 50 steps or fewer, dramatically improving inference speed.
When , DDIM becomes a completely deterministic process, which provides several important advantages:
- Semantic interpolation: Meaningful interpolation in latent space
- Reconstruction capability: Given noise can reconstruct the same image
- Controllable generation: Facilitates various conditional generation tasks
1.4.4. Connections to Other Methods
- Relation to Probability Flow ODE: DDIM's deterministic sampling actually approximates solving the probability flow ODE
- Unification with DDPM: DDIM can be viewed as a generalization of DDPM under different parameters
- Quality vs Diversity Trade-off: Smaller provides better sample quality but reduces diversity