Score-Based Models

June 9, 2025

by Leonardo

1. Use score-based models

Score-Matching Langevin Dynamics (SMLD)

Langevin equation: $𝑥_{𝑡 + 1} = 𝑥_{𝑡} + 𝜏 \nabla_{𝑥} \log 𝑝 (𝑥_{𝑡}) + \sqrt{2 𝜏} 𝑧, 𝑧 \sim 𝒩 (0, 𝐼)$

Stein's score function: $𝑠_{𝜃} (𝑥) = \nabla_{𝑥} \log 𝑝_{𝜃} (𝑥)$ . Do not confuse this with the original score function $𝑠_{𝑋} (𝜃) = \nabla_{𝜃} \log 𝑝_{𝜃} (𝑥)$ . For example, if $𝑝 (𝑥)$ is a Gaussian distribution, then $𝑠 (𝑥) = - \frac{𝑥 - 𝜇}{𝜎^{2}}$ . Using the score function, we don't need any special architectures to make the normalizing constant tractable.

We minimize fisher divergence between the model and the data distribution:

𝐸_{𝑥 \sim 𝑝 (𝑥)} [{‖ 𝑠_{𝜃} (𝑥) - \nabla_{𝑥} \log 𝑝 (𝑥) ‖}_{2}^{2}]

The key challenge is the fact that the estimated score functions are inaccurate in low density regions, where few data points are available for computing the score matching objective.

1.0.1. Explicit Score Matching

Consider the classical kernel density estimation by defining a distribution $𝑞_{ℎ} (𝑥) = \frac{1}{𝑀} \sum_{𝑚 = 1}^{𝑀} \frac{1}{ℎ} 𝐾 (\frac{𝑥 - 𝑥^{𝑚}}{ℎ})$ , where $ℎ$ is some hyperparameter for the kernel function $𝐾$ and $𝑥^{𝑚}$ is the $𝑚$ -th sample in the training set.

This $𝑞 (𝑥)$ is a smooth approximation of $𝑝 (𝑥)$ which is never known, so we can learn $𝑠_{𝜃} (𝑥)$ based on $𝑞 (𝑥)$ .

The explicit score matching loss is

\begin{matrix} 𝐽_{ESM} (𝜃) & = \frac{1}{2} 𝐸_{𝑥 \sim 𝑝 (𝑥)} [{‖ 𝑠_{𝜃} (𝑥) - \nabla_{𝑥} \log 𝑝 (𝑥) ‖}^{2}] \\ \approx \frac{1}{2} 𝐸_{𝑥 \sim 𝑞_{ℎ} (𝑥)} [{‖ 𝑠_{𝜃} (𝑥) - \nabla_{𝑥} \log 𝑞_{ℎ} (𝑥) ‖}^{2}] \\ = \frac{1}{2} \int {‖ 𝑠_{𝜃} (𝑥) - \nabla_{𝑥} \log 𝑞_{ℎ} (𝑥) ‖}^{2} 𝑞_{ℎ} (𝑥) 𝑑 𝑥 \\ \approx \frac{1}{2} \frac{1}{𝑀} \sum_{𝑚 = 1}^{𝑀} {‖ 𝑠_{𝜃} (𝑥) - \nabla_{𝑥} \log 𝑞_{ℎ} (𝑥) ‖}^{2} \frac{1}{ℎ} 𝐾 (\frac{𝑥 - 𝑥^{𝑚}}{ℎ}) 𝑑 𝑥 \end{matrix}

The issue of explicit score matching is that the kernel density estimation is a fairly poor non-parameter estimation of the true distribution.

1.0.2. Implicit Score Matching

\begin{matrix} 𝐽_{ISM} (𝜃) & = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [Tr (\nabla_{𝑥} 𝑠_{𝜃} (𝑥)) + \frac{1}{2} {‖ 𝑠_{𝜃} (𝑥) ‖}^{2}] \\ \approx \frac{1}{𝑀} \sum_{𝑚 = 1}^{𝑀} \sum_{𝑖} (\partial_{𝑖} 𝑠_{𝜃} (𝑥^{𝑚}) + \frac{1}{2} {| {[𝑠_{𝜃} (𝑥^{𝑚})]}_{𝑖} |}^{2}) \end{matrix}

If the model for the score function is realized by a deep neural network, the trace operator can be difficult to compute, hence making the implicit score matching not scalable.

1.0.3. Denoising Score Matching

𝐽_{DSM} (𝜃) = 𝐸_{(𝑥, 𝑥^{'}) \sim 𝑞 (𝑥, 𝑥^{'})} [\frac{1}{2} {‖ 𝑠_{𝜃} (𝑥) - \nabla_{𝑥} \log 𝑞 (𝑥 | 𝑥^{'}) ‖}^{2}]

In the special case where $𝑞 (𝑥 | 𝑥^{'}) = 𝒩 (𝑥 | 𝑥^{'}, 𝜎^{2})$ , then $\nabla_{𝑥} \log 𝑞 (𝑥 | 𝑥^{'}) = \frac{𝑥 - 𝑥^{'}}{𝜎^{2}} = - \frac{𝑧}{𝜎}$ where $𝑧 \sim 𝒩 (0, 1)$ . So we have

𝐽_{DSM} (𝜃) = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\frac{1}{2} {‖ 𝑠_{𝜃} (𝑥 + 𝜎 𝑧) + \frac{𝑧}{𝜎} ‖}^{2}]

The beauty about this equation is that it is highly interpretable. The quantity $𝑥 + 𝜎 𝑧$ is effectively adding noise $𝜎 𝑧$ to a clean image $𝑥$ . The score function $𝑠_{𝜃}$ is supposed to take this noisy image and predict the noise $\frac{𝑧}{𝜎}$ . Predicting noise is equivalent to denoising, because any denoised image plus the predicted noise will give us the noisy observation. Therefore, this equation is a denoising step.

And for up to a constant $𝐶$ which is independent of the variable $𝜃$ , it holds that $𝐽_{DSM} (𝜃) = 𝐽_{ESM} (𝜃) + 𝐶$ .

The noise conditioned score network (NCSN) optimize the following loss:

𝐽_{NCSN} (𝜃) = \frac{1}{𝐿} \sum_{𝑖 = 1}^{𝐿} 𝜆 (𝜎_{𝑖}) 𝓁 (𝜃; 𝜎_{𝑖})

where the individual loss function is defined according to the noise levels $𝜎_{1}, \dots, 𝜎_{𝐿}$ :

𝓁 (𝜃; 𝜎) = 𝐸_{𝑥 \sim 𝑝 (𝑥)} [\frac{1}{2} {‖ 𝑠_{𝜃} (𝑥 + 𝜎 𝑧) + \frac{𝑧}{𝜎} ‖}^{2}]

The coefficient function is often chosen as $𝜆 (𝜎) = 𝜎^{2}$ based on empirical findings. The noise level sequence often satisfies $\frac{𝜎_{1}}{𝜎_{2}} = \frac{𝜎_{2}}{𝜎_{3}} = \dots = \frac{𝜎_{𝐿 - 1}}{𝜎_{𝐿}} > 1$ .

For inference, we use Annealed Langevin Dynamics:

𝑥_{𝑡 + 1} = 𝑥_{𝑡} + \frac{𝛼_{𝑖}}{2} 𝑠_{𝜃} (𝑥_{𝑡}, 𝜎_{𝑖}) + \sqrt{𝛼_{𝑖}} 𝑧_{𝑡}, 𝑧_{𝑡} \sim 𝒩 (0, 𝐼)

where $𝛼_{𝑖} = \frac{𝜎_{𝑖}^{2}}{𝜎_{𝐿}^{2}}$ is the step size.

1.1. Denoising Diffusion Probabilistic Models (DDPM)

DDPM provides an alternative perspective on score-based models through the lens of variational inference. Instead of directly learning score functions, DDPM learns to reverse a fixed noising process.

1.1.1. The Forward Process

DDPM defines a fixed Markov chain that gradually destroys data structure: $𝑞 (𝑥_{𝑡} | 𝑥_{𝑡 - 1}) = 𝒩 (𝑥_{𝑡}; \sqrt{1 - 𝛽_{𝑡}} 𝑥_{𝑡 - 1}, 𝛽_{𝑡} 𝐼)$ . Through clever reparameterization, we can sample $𝑥_{𝑡}$ directly: $𝑥_{𝑡} = \sqrt{{\bar{𝛼}}_{𝑡}} 𝑥_{0} + \sqrt{1 - {\bar{𝛼}}_{𝑡}} 𝜀, 𝜀 \sim 𝒩 (0, 𝐼)$ where $𝛼_{𝑡} = 1 - 𝛽_{𝑡}$ and ${\bar{𝛼}}_{𝑡} = \prod_{𝑖 = 1}^{𝑡} 𝛼_{𝑖}$ .

1.1.2. The Reverse Process

The key insight is to learn a reverse Markov chain that undoes the forward process: $𝑝_{𝜃} (𝑥_{𝑡 - 1} | 𝑥_{𝑡}) = 𝒩 (𝑥_{𝑡 - 1}; 𝜇_{𝜃} (𝑥_{𝑡}, 𝑡), 𝜎_{𝑡}^{2} 𝐼)$ . By maximizing the ELBO, we arrive at a remarkably simple objective:

ℒ = 𝐸_{𝑡, 𝑥_{0}, 𝜀_{0}} [𝐸_{𝑞 (𝑥_{𝑡} | 𝑥_{0})} [\frac{1}{2 𝜎_{𝑞}^{2} (𝑡)} \frac{{(1 - 𝛼_{𝑡})}^{2}}{(1 - {\bar{𝛼}}_{𝑡}) 𝛼_{𝑡}} {‖ 𝜀_{𝜃} (\sqrt{{\bar{𝛼}}_{𝑡}} 𝑥_{0} + \sqrt{1 - {\bar{𝛼}}_{𝑡}} 𝜀_{0}) - 𝜀_{0} ‖}^{2}]], ℒ_{simple} = 𝐸_{𝑡, 𝑥_{0}, 𝜀} [{‖ 𝜀 - 𝜀_{𝜃} (𝑥_{𝑡}) ‖}^{2}]

The model learns to predict the noise $𝜀$ that was added at each timestep. The noise prediction in DDPM is equivalent to score estimation: $𝑠_{𝜃} (𝑥_{𝑡}, 𝑡) = - \frac{𝜀_{𝜃} (𝑥_{𝑡}, 𝑡)}{\sqrt{1 - {\bar{𝛼}}_{𝑡}}}$ . This reveals that DDPM is implicitly performing denoising score matching at multiple noise levels, just like NCSN but with a different parameterization.

1.2. Stochastic Differential Equation (SDE)

Ordinary differential equation (ODE): $𝑑 \frac{𝑥 (𝑡)}{𝑑} 𝑡 = 𝑓 (𝑡, 𝑥)$ , assuming the initial condition $𝑥 (0) = 𝑥_{0}$ , the solution is $𝑥 (𝑡) = 𝑥_{0} + \int_{0}^{𝑡} 𝑓 (𝑠, 𝑥 (𝑠)) 𝑑 𝑠$ . The differential form is $𝑑 𝑥 (𝑡) = 𝑓 (𝑡, 𝑥 (𝑡)) 𝑑 𝑡$ .
Stochastic differential equation (SDE) (SDE): $𝑑 \frac{𝑥 (𝑡)}{𝑑} 𝑡 = 𝑓 (𝑡, 𝑥) + 𝑔 (𝑡, 𝑥) 𝜉 (𝑡)$ where $𝜉 (𝑡) \sim 𝒩 (0, 𝐼)$ . We can define $𝑑 𝑤 = 𝜉 (𝑡) 𝑑 𝑡$ , then $𝑑 𝑥 = 𝑓 (𝑡, 𝑥) 𝑑 𝑡 + 𝑔 (𝑡, 𝑥) 𝑑 𝑤$ .

Forward Diffusion: $𝑑 𝑥 = 𝑓 (𝑥, 𝑡) 𝑑 𝑡 + 𝑔 (𝑡) 𝑑 𝑤$ .

Reverse Diffusion: $𝑑 𝑥 = [𝑓 (𝑥, 𝑡) - {𝑔 (𝑡)}^{2} \nabla_{𝑥} \log 𝑝_{𝑡} (𝑥)] 𝑑 𝑡 + 𝑔 (𝑡) 𝑑 \bar{𝑤}$ where $𝑝_{𝑡} (𝑥)$ is the probability distribution of $𝑥$ at time $𝑡$ and $\bar{𝑤}$ is the Wiener process when time flows backward. Compared to the Langevin Dynamics, this gives us a more general framework and is time-continuous. By solving the estimated reverse SDE with numerical SDE solvers, we can simulate the reverse stochastic process for sample generation.

The forward sampling equation of DDPM can be written as $𝑑 𝑥 = - \frac{𝛽 (𝑡)}{2} 𝑥 𝑑 𝑡 + \sqrt{𝛽 (𝑡)} 𝑑 𝑤$ . The reverse sampling equation of DDPM can be written as $𝑑 𝑥 = - 𝛽 (𝑡) [\frac{𝑥}{2} + \nabla_{𝑥} \log 𝑝_{𝑡} (𝑥)] 𝑑 𝑡 + \sqrt{𝛽 (𝑡)} 𝑑 \bar{𝑤}$ . This is Variance Preserving (VP) SDE.

The forward sampling equation of SMLD can be written as $𝑑 𝑥 = \sqrt{(\frac{𝑑 𝜎 (𝑡^{2})}{𝑑 𝑡})} 𝑑 𝑤$ . The reverse sampling equation of SMLD can be written as $𝑑 𝑥 = - (\frac{𝑑 𝜎 (𝑡^{2})}{𝑑 𝑡} \nabla_{𝑥} \log 𝑝_{𝑡} (𝑥)) (𝑑 𝑡) + \sqrt{\frac{𝑑 𝜎 (𝑡^{2})}{𝑑 𝑡}} 𝑑 \bar{𝑤}$ . This is Variance Exploding (VE) SDE.

It's possible to convert any SDE into an ordinary differential equation (ODE) without changing its marginal distributions.

The corresponding ODE of an SDE is named probability flow ODE, given by

𝑑 𝑥 = [𝑓 (𝑥, 𝑡) - \frac{1}{2} {𝑔 (𝑡)}^{2} \nabla_{𝑥} \log 𝑝_{𝑡} (𝑥)] 𝑑 𝑡

This is a completely deterministic evolution equation, yet it preserves the probabilistic structure of the original SDE.

Predictor-Corrector Samplers: The predictor step uses numerical ODE or SDE solvers to advance along the reverse-time trajectory. This step provides the "big picture" direction, essentially telling us where we should move next based on our learned score function. However, numerical solvers accumulate errors over time, and our score function estimates are imperfect, leading to gradual drift away from the true data distribution.

The corrector step addresses these imperfections by applying a few iterations of Langevin dynamics at the current time point. Since we know what the distribution should look like at any given time during the reverse process, we can use MCMC sampling to "correct" our current sample to better match that target distribution. This local refinement helps counteract the accumulated errors from the predictor steps.

Consistency Model

1.3. Guidance Methods for Controllable Generation

1.3.1. Classifier Guidance

Classifier guidance enables conditional generation by incorporating a pre-trained classifier during sampling. The key insight is that we can decompose the conditional score function using Bayes' rule:

\nabla_{𝑥} \log 𝑝_{𝑡} (𝑥_{𝑡} | 𝑦) = \nabla_{𝑥} \log 𝑝_{𝑡} (𝑥_{𝑡}) + \nabla_{𝑥} \log 𝑝 (𝑦 | 𝑥_{𝑡})

where:

$\nabla_{𝑥} \log 𝑝_{𝑡} (𝑥_{𝑡})$ is the unconditional score function (learned by the diffusion model)
$\nabla_{𝑥} \log 𝑝 (𝑦 | 𝑥_{𝑡})$ is the gradient of the classifier's log probability

The modified reverse SDE becomes:

𝑑 𝑥 = [𝑓 (𝑥, 𝑡) - {𝑔 (𝑡)}^{2} (\nabla_{𝑥} \log 𝑝_{𝑡} (𝑥) + 𝛾 \nabla_{𝑥} \log 𝑝 (𝑦 | 𝑥))] 𝑑 𝑡 + 𝑔 (𝑡) 𝑑 \bar{𝑤}

where $𝛾$ is the guidance scale that controls the strength of conditioning.

Implementation Details:

Train a classifier $𝑝_{𝜑} (𝑦 | 𝑥_{𝑡}, 𝑡)$ on noisy images at various noise levels
During sampling, compute classifier gradients: $\nabla_{𝑥} \log 𝑝_{𝜑} (𝑦 | 𝑥_{𝑡}, 𝑡)$
Scale these gradients by $𝛾$ and add to the unconditional score

1.3.2. CLIP Guidance

CLIP guidance is a strategy that replaces the standard classifier with a CLIP model to steer the diffusion process towards a text caption.

In particular, we perturb the reverse-process mean with the gradient of the dot product of the image and caption encodings with respect to the image:

{\hat{𝜇}}_{𝜃} (𝑥_{𝑡} | 𝑦) = 𝜇_{𝜃} (𝑥_{𝑡} | 𝑦) + 𝑠 \cdot \sum_{𝜃} (𝑥_{𝑡} | 𝑦) \nabla_{𝑥_{𝑡}} (𝑓 (𝑥_{𝑡}) \cdot 𝑔 (𝑦))

where $\sum_{𝜃} (𝑥_{𝑡} | 𝑦)$ is the variance.

Similar to classifier guidance, we must train CLIP on noised images $𝑥_{𝑡}$ to obtain the correct gradient in the reverse process. Throughout our experiments, we use CLIP models that were explicitly trained to be noise-aware, which we refer to as noised CLIP models.

1.3.3. Classifier-Free Guidance

Classifier-free guidance achieves conditional generation without requiring a separate classifier. Instead, it trains a single diffusion model to handle both conditional and unconditional generation.

Training Procedure: During training, randomly drop the conditioning information with probability $𝑝_{uncond}$ (typically 10-20%):

𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦) = {\begin{matrix} 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦) & with probability (1 - 𝑝_{uncond}) \\ 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, \emptyset) & with probability 𝑝_{uncond} \end{matrix}

Sampling Procedure: The classifier-free guidance formula combines conditional and unconditional predictions:

{\tilde{𝜀}}_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦) = 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, \emptyset) + 𝛾 (𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦) - 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, \emptyset))

This can be rewritten as:

{\tilde{𝜀}}_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦) = (1 - 𝛾) 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, \emptyset) + 𝛾 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦)

Connection to Classifier Guidance: Classifier-free guidance implicitly learns the classifier gradient term:

𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦) - 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡, \emptyset) \approx 𝛾 \sqrt{1 - {\bar{𝛼}}_{𝑡}} \nabla_{𝑥} \log 𝑝 (𝑦 | 𝑥_{𝑡})

1.4. Denoising Diffusion Implicit Models (DDIM)

While DDPM achieves excellent generation quality, its sampling process requires many steps (typically 1000), leading to slow inference. DDIM enables faster sampling by introducing a non-Markovian sampling process.

The key insight of DDIM is that given the marginal distributions of the forward process, there exist infinitely many reverse processes that can produce the same marginals. DDPM is just one special case (Markovian process).

1.4.1. Non-Markovian Forward Process

DDIM defines a more general forward process:

𝑞_{𝜎} (𝑥_{1 : 𝑇} | 𝑥_{0}) = 𝑞_{𝜎} (𝑥_{𝑇} | 𝑥_{0}) \prod_{𝑡 = 2}^{𝑇} 𝑞_{𝜎} (𝑥_{𝑡 - 1} | 𝑥_{𝑡}, 𝑥_{0})

where the conditional distribution is:

𝑞_{𝜎} (𝑥_{𝑡 - 1} | 𝑥_{𝑡}, 𝑥_{0}) = 𝒩 (\sqrt{{\bar{𝛼}}_{𝑡 - 1}} 𝑥_{0} + \sqrt{1 - {\bar{𝛼}}_{𝑡 - 1} - 𝜎_{𝑡}^{2}} \cdot \frac{𝑥_{𝑡} - \sqrt{{\bar{𝛼}}_{𝑡}} 𝑥_{0}}{\sqrt{1 - {\bar{𝛼}}_{𝑡}}}, 𝜎_{𝑡}^{2} 𝐼)

Here $𝜎_{𝑡}$ is an adjustable parameter:

When $𝜎_{𝑡}^{2} = {\tilde{𝛽}}_{𝑡} = \frac{1 - {\bar{𝛼}}_{𝑡 - 1}}{1 - {\bar{𝛼}}_{𝑡}} 𝛽_{𝑡}$ , it reduces to DDPM
When $𝜎_{𝑡} = 0$ , it becomes a completely deterministic process

1.4.2. DDIM Sampling Formula

The DDIM reverse sampling process is:

𝑥_{𝑡 - 1} = \sqrt{{\bar{𝛼}}_{𝑡 - 1}} \underset{predicted 𝑥_{0}}{\underset{⏟}{(\frac{𝑥_{𝑡} - \sqrt{1 - {\bar{𝛼}}_{𝑡}} 𝜀_{𝜃}^{(𝑡)} (𝑥_{𝑡})}{\sqrt{{\bar{𝛼}}_{𝑡}}})}} + \sqrt{1 - {\bar{𝛼}}_{𝑡 - 1} - 𝜎_{𝑡}^{2}} \cdot 𝜀_{𝜃}^{(𝑡)} (𝑥_{𝑡}) + 𝜎_{𝑡} 𝜀_{𝑡}

where $𝜀_{𝑡} \sim 𝒩 (0, 𝐼)$ provides optional randomness.

1.4.3. Accelerated Sampling

DDIM's most important contribution is skip-step sampling. We can define a subsequence ${𝑡_{1}, 𝑡_{2}, \dots, 𝑡_{𝑆}} \subset {1, 2, \dots, 𝑇}$ where $𝑆 ≪ 𝑇$ , then use:

𝑥_{𝑡_{𝑖 - 1}} = \sqrt{{\bar{𝛼}}_{𝑡_{𝑖 - 1}}} (\frac{𝑥_{𝑡_{𝑖}} - \sqrt{1 - {\bar{𝛼}}_{𝑡_{𝑖}}} 𝜀_{𝜃}^{(𝑡_{𝑖})} (𝑥_{𝑡_{𝑖}})}{\sqrt{{\bar{𝛼}}_{𝑡_{𝑖}}}}) + \sqrt{1 - {\bar{𝛼}}_{𝑡_{𝑖 - 1}}} \cdot 𝜀_{𝜃}^{(𝑡_{𝑖})} (𝑥_{𝑡_{𝑖}})

This reduces 1000-step sampling to 50 steps or fewer, dramatically improving inference speed.

When $𝜎_{𝑡} = 0$ , DDIM becomes a completely deterministic process, which provides several important advantages:

Semantic interpolation: Meaningful interpolation in latent space
Reconstruction capability: Given noise can reconstruct the same image
Controllable generation: Facilitates various conditional generation tasks

1.4.4. Connections to Other Methods

Relation to Probability Flow ODE: DDIM's deterministic sampling actually approximates solving the probability flow ODE
Unification with DDPM: DDIM can be viewed as a generalization of DDPM under different $𝜎$ parameters
Quality vs Diversity Trade-off: Smaller $𝜎$ provides better sample quality but reduces diversity

🔒 Access Restricted

Access Control