Consistency Models

June 27, 2025

by Leonardo

1. Consistency Models

While diffusion models achieve remarkable sample quality, they require hundreds or thousands of function evaluations during generation, making them computationally expensive for real-time applications.

The key insight behind Consistency Models is to learn a direct mapping from noise to data, enabling single-step generation while preserving the high-quality outputs characteristic of diffusion models.

1.1. The Core Intuition

The fundamental idea of Consistency Models stems from a simple yet powerful observation: if we can learn to map any point on a diffusion trajectory directly to its corresponding clean data point, we can bypass the iterative denoising process entirely.

Consider the probability flow ODE that underlies diffusion models:

𝑑 𝒙 = [𝒇 (𝒙, 𝑡) - \frac{1}{2} {𝑔 (𝑡)}^{2} \nabla_{𝒙} \log 𝑝_{𝑡} (𝒙)] 𝑑 𝑡

This ODE defines a deterministic trajectory from noise $𝒙_{𝑇} \sim 𝒩 (𝟎, 𝐼)$ to data $𝒙_{0} \sim 𝑝_{data}$ . A Consistency Model learns a function $𝑓_{𝜃} : (𝒙_{𝑡}, 𝑡) \mapsto 𝒙_{0}$ that maps any point $(𝒙_{𝑡}, 𝑡)$ on this trajectory to the trajectory's origin $𝒙_{0}$ .

1.1.1. Self-Consistency Property

The defining characteristic of Consistency Models is the self-consistency property:

Definition: A function $𝑓 : (𝒙, 𝑡) \mapsto 𝒙$ satisfies the self-consistency property if:

𝑓 (𝒙_{𝑡}, 𝑡) = 𝑓 (𝒙_{𝑡^{'}}, 𝑡^{'}) \forall 𝒙_{𝑡}, 𝒙_{𝑡^{'}} on the same ODE trajectory

In other words, for any two points on the same trajectory, the consistency function should map both to the same endpoint. This ensures that:

$𝑓 (𝒙_{0}, 0) = 𝒙_{0}$ (identity mapping at the boundary)
$𝑓 (𝒙_{𝑡}, 𝑡) = 𝒙_{0}$ for all $𝑡 > 0$ (consistent prediction)

1.2. Mathematical Framework

1.2.1. Parameterization

To ensure the boundary condition $𝑓 (𝒙_{0}, 0) = 𝒙_{0}$ , Consistency Models use a specific parameterization:

𝑓_{𝜃} (𝒙, 𝑡) = {\begin{matrix} 𝒙 & if 𝑡 = 0 \\ 𝐹_{𝜃} (𝒙, 𝑡) & if 𝑡 > 0 \end{matrix}

where $𝐹_{𝜃}$ is a deep neural network. In practice, this is implemented as:

𝑓_{𝜃} (𝒙, 𝑡) = 𝑐_{skip} (𝑡) 𝒙 + 𝑐_{out} (𝑡) 𝐹_{𝜃} (𝒙, 𝑡)

with $𝑐_{skip} (0) = 1$ , $𝑐_{out} (0) = 0$ , ensuring the boundary condition is satisfied.

1.2.2. Discretized Training

For computational tractability, the continuous time interval $[0, 𝑇]$ is discretized into $𝑁$ timesteps:

0 = 𝑡_{0} < 𝑡_{1} < 𝑡_{2} < \dots < 𝑡_{𝑁 - 1} < 𝑡_{𝑁} = 𝑇

The consistency function is trained to satisfy:

𝑓_{𝜃} (𝒙_{𝑡_{𝑛}}, 𝑡_{𝑛}) = 𝑓_{𝜃} (𝒙_{𝑡_{𝑛 + 1}}, 𝑡_{𝑛 + 1})

for consecutive timesteps on the same trajectory.

1.3. Training Methods

There are two primary approaches to training Consistency Models, each with distinct advantages and trade-offs.

1.3.1. Consistency Distillation (CD)

Consistency Distillation leverages a pre-trained diffusion model (teacher) to generate training pairs for the consistency model (student).

Algorithm:

Sample $𝒙 \sim 𝑝_{data}$ and $𝑛 \sim Uniform (1, 𝑁 - 1)$
Generate noisy sample: $𝒙_{𝑡_{𝑛 + 1}} = 𝒙 + 𝑡_{𝑛 + 1} 𝒛$ where $𝒛 \sim 𝒩 (𝟎, 𝐼)$
Use the pre-trained score model to estimate: ${\hat{𝒙}}_{𝑡_{𝑛}}^{𝜑} = 𝒙_{𝑡_{𝑛 + 1}} + (𝑡_{𝑛} - 𝑡_{𝑛 + 1}) Φ (𝒙_{𝑡_{𝑛 + 1}}, 𝑡_{𝑛 + 1})$
Minimize the consistency loss:

ℒ_{CD}^{𝑁} (𝜃, 𝜃^{-}; 𝜑) = 𝐸 [𝜆 (𝑡_{𝑛}) 𝑑 (𝑓_{𝜃} (𝒙_{𝑡_{𝑛 + 1}}, 𝑡_{𝑛 + 1}), 𝑓_{𝜃^{-}} ({\hat{𝒙}}_{𝑡_{𝑛}}^{𝜑}, 𝑡_{𝑛}))]

where:

$Φ$ represents the pre-trained score model parameters
$𝜃^{-}$ is an exponential moving average of $𝜃$ : $𝜃^{-} \leftarrow stopgrad (𝜇 𝜃^{-} + (1 - 𝜇) 𝜃)$
$𝜆 (𝑡)$ is a positive weighting function, typically $𝜆 (𝑡) = 1$
$𝑑 (\cdot, \cdot)$ is a distance metric (e.g., $ℓ_{2}$ or LPIPS (learned perceptual image patch similarity))

1.3.2. Consistency Training (CT)

Consistency Training learns consistency models from scratch without relying on pre-trained diffusion models.

Algorithm:

Sample $𝒙 \sim 𝑝_{data}$ , $𝑛 \sim Uniform (1, 𝑁 - 1)$ , and $𝒛 \sim 𝒩 (𝟎, 𝐼)$
Define adjacent points on trajectory:
- $𝒙_{𝑡_{𝑛 + 1}} = 𝒙 + 𝑡_{𝑛 + 1} 𝒛$
- $𝒙_{𝑡_{𝑛}} = 𝒙 + 𝑡_{𝑛} 𝒛$
Minimize the consistency loss:

ℒ_{CT}^{𝑁} (𝜃, 𝜃^{-}) = 𝐸 [𝜆 (𝑡_{𝑛}) 𝑑 (𝑓_{𝜃} (𝒙_{𝑡_{𝑛 + 1}}, 𝑡_{𝑛 + 1}), 𝑓_{𝜃^{-}} (𝒙_{𝑡_{𝑛}}, 𝑡_{𝑛}))]

According to the experiments in the paper, they found,

Heun ODE solver works better than Euler's first-order solver, since higher order ODE solvers have smaller estimation errors with the same $𝑁$ .
Among different options of the distance metric function $𝑑 (\cdot, \cdot)$ , the LPIPS metric works better than $ℓ_{1}$ and $ℓ_{2}$ distance.
Smaller $𝑁$ leads to faster convergence but worse samples, whereas larger $𝑁$ leads to slower convergence but better samples upon convergence.

1.3.3. Target Network Updates

Both training methods employ a target network $𝑓_{𝜃^{-}}$ to stabilize training. The target parameters are updated via exponential moving average:

𝜃^{-} \leftarrow stopgrad (𝜇 𝜃^{-} + (1 - 𝜇) 𝜃)

where $𝜇 \in [0, 1)$ is the decay rate. This prevents the consistency loss from becoming degenerate (where the model simply outputs the same value for all inputs).

1.4. Sampling and Inference

1.4.1. Single-Step Sampling

The most straightforward sampling approach is single-step generation:

{\hat{𝒙}}_{𝜀} = 𝑓_{𝜃} ({\hat{𝒙}}_{𝑇}, 𝑇) where {\hat{𝒙}}_{𝑇} \sim 𝒩 (𝟎, 𝑇^{2} 𝐼)

The key insight is that $𝑓_{𝜃}$ has learned to map any point on the diffusion trajectory back to the clean data, so we can start from pure noise and get the final result immediately.

1.4.2. Multi-Step Sampling

While single-step sampling is fast, multi-step sampling can improve quality by alternating denoising and noise injection steps:

This process allows trading computational cost for sample quality, similar to how diffusion models work but with far fewer steps.

1.5. Improved Consistency Training (iCT)

Removing EMA for the Teacher Network: The authors identified a theoretical flaw where using Exponential Moving Average (EMA) for the teacher network provides no useful training signal for the data distribution. By setting the EMA decay rate to zero ( $𝜇 (𝑘) = 0$ ), they ensure the teacher and student parameters are correctly aligned, which significantly boosts performance.
Pseudo-Huber Loss: To replace the biased and computationally expensive LPIPS metric, the authors adopt the Pseudo-Huber loss, defined as $𝑑 (𝑥, 𝑦) = \sqrt{{‖ 𝑥 - 𝑦 ‖}_{2}^{2} + 𝑐^{2}} - 𝑐$ . This simple metric is robust to outliers, reduces training variance, and ultimately surpasses the performance of LPIPS-based training.
Improved Discretization Curriculum: The paper demonstrates that model performance scales with the number of discretization steps ( $𝑁$ ) according to a power law. Based on this, they propose a new exponential curriculum for $𝑁 (𝑘)$ —doubling $𝑁$ at fixed intervals—which empirically yields the best sample quality.
Lognormal Noise Schedule: The default training procedure over-emphasizes high noise levels. The authors introduce a lognormal noise schedule to focus training on more critical low-to-mid noise ranges, leading to a notable improvement in sample quality.
Optimized Hyperparameters: The work also introduces an improved weighting function ( $𝜆 (𝜎_{𝑖}) = \frac{1}{𝜎_{𝑖 + 1} - 𝜎_{𝑖}}$ ), increased dropout rates, and fine-tuned noise embeddings to further enhance performance.

1.6. Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models (sCM)

Figure 3: Discrete-time CMs (top & middle) vs. continuous-time CMs (bottom). Discrete-time CMs suffer from discretization errors from numerical ODE solvers, causing imprecise predictions during training. In contrast, continuous-time CMs stay on the ODE trajectory by following its tangent direction with infinitesimal steps.

Discrete-time CMs The training objective is defined at two adjacent time steps with finite distance:

$𝐸_{𝒙_{𝑡}, 𝑡} [𝑤 (𝑡) 𝑑 (𝑓_{𝜃} (𝒙_{𝑡}, 𝑡), 𝑓_{𝜃^{-}} (𝒙_{𝑡 - Δ 𝑡}, 𝑡 - Δ 𝑡))]$

where $𝜃^{-}$ denotes $stopgrad (𝜃)$ , $𝑤 (𝑡)$ is the weighting function, $Δ 𝑡 > 0$ is the distance between two adjacent time steps, and $𝑑 (\cdot, \cdot)$ is the distance metric.
Continuous-time CMs We use $𝑑 (𝑥, 𝑦) = {‖ 𝑥 - 𝑡 ‖}^{2}$ and take $Δ 𝑡 \to 0$ , we can show that

$\lim_{Δ 𝑡 \to 0} 𝐸_{𝒙_{𝑡}, 𝑡} [𝑤 (𝑡) 𝑑 (𝑓_{𝜃} (𝒙_{𝑡}, 𝑡), 𝑓_{𝜃^{-}} (𝒙_{𝑡 - Δ 𝑡}, 𝑡 - Δ 𝑡))] = \nabla_{𝜃} 𝐸_{𝒙_{𝑡}, 𝑡} [𝑤 (𝑡) 𝑓_{𝜃}^{𝑇} (𝒙_{𝑡}, 𝑡) \frac{𝑑 𝑓_{𝜃^{-}} (𝒙_{𝑡}, 𝑡)}{𝑑 𝑡}]$

where $\frac{𝑑 𝑓_{𝜃^{-}} (𝒙_{𝑡}, 𝑡)}{𝑑 𝑡} = \nabla_{𝒙_{𝑡}} 𝑓_{𝜃^{-}} (𝒙_{𝑡}, 𝑡) \frac{𝑑 𝒙_{𝑡}}{𝑑 𝑡} + \partial_{𝑡} 𝑓_{𝜃^{-}} (𝒙_{𝑡}, 𝑡)$ is the tangent of $𝑓_{𝜃^{-}}$ at $(𝒙_{𝑡}, 𝑡)$ along the trajectory of the PF-ODE $\frac{𝑑 𝒙}{𝑑 𝑡}$ .

Notably, continuous-time CMs do not rely on ODE solvers, which avoids discretization errors and offers more accurate supervision signals during training. However, previous work found that training continuous-time CMs, or even discrete-time CMs with an extremely small $Δ 𝑡$ , suffers from severe instability in optimization. This greatly limits the empirical performance and adoption of continuous-time CMs.

To address the stability issues of continuous-time consistency models, researchers introduced TrigFlow - a simplified theoretical framework that unifies EDM (Elucidated Diffusion Models) and Flow Matching while significantly simplifying the mathematical formulation.

TrigFlow uses trigonometric functions to parameterize the diffusion process, making the mathematical expressions much cleaner and more stable. The key insight is to replace the complex EDM coefficients with simple trigonometric functions:

Diffusion Process:

𝒙_{𝑡} = \cos (𝑡) 𝒙_{0} + \sin (𝑡) 𝒛

where $𝑡 \in [0, \frac{𝜋}{2}]$ and $𝒛 \sim 𝒩 (𝟎, 𝜎_{𝑑}^{2} 𝑰)$ .

Diffusion Models and PF-ODE:

𝑓_{𝜃}^{DM} (𝒙_{𝑡}, 𝑡) = 𝐹_{𝜃} (\frac{𝒙_{𝑡}}{𝜎_{𝑑}}, 𝑐_{noise} (𝑡))

where $𝐹_{𝜃}$ is a neural network with parameters $𝜃$ , and $𝑐_{noise} (𝑡)$ is a transformation of $𝑡$ to facilitate time conditioning. The corresponding PF-ODE is given by

\frac{𝑑 𝒙_{𝑡}}{𝑑 𝑡} = 𝜎_{𝑑} 𝐹_{𝜃} (\frac{𝒙_{𝑡}}{𝜎_{𝑑}}, 𝑐_{noise} (𝑡))

Training Target:

ℒ_{Diff} (𝜃) = 𝐸_{𝑥_{0}, 𝑧, 𝑡} [{‖ 𝜎_{𝑑} 𝐹_{𝜃} (\frac{𝒙_{𝑡}}{𝜎_{𝑑}}, 𝑐_{noise} (𝑡)) - 𝑣_{𝑡} ‖}_{2}^{2}]

𝒗_{𝑡} = \cos (𝑡) 𝒛 - \sin (𝑡) 𝒙_{0}

Consistency Model Parameterization:

\begin{matrix} 𝑐_{skip} (𝑡) = \cos (𝑡), 𝑐_{out} (𝑡) = - 𝜎_{𝑑} \sin (𝑡), 𝑐_{in} (𝑡) = \frac{1}{𝜎_{𝑑}} \\ 𝑓_{𝜃} (𝒙_{𝑡}, 𝑡) = \cos (𝑡) 𝒙_{𝑡} - \sin (𝑡) 𝜎_{𝑑} 𝐹_{𝜃} (\frac{𝒙_{𝑡}}{𝜎_{𝑑}}, 𝑐_{noise (𝑡)}) \end{matrix}

1.6.1. Stabilization Techniques

The TrigFlow framework incorporates several key stabilization techniques:

Identity Time Transformation: Using $𝑐_{noise} (𝑡) = 𝑡$ instead of the complex logarithmic transformation from EDM prevents numerical blow-up as $𝑡 \to \frac{𝜋}{2}$ .
Positional Time Embeddings: Avoiding high-frequency Fourier embeddings in favor of positional embeddings reduces gradient instability.
Adaptive Double Normalization: Modifying the AdaGN layers to use pixel normalization for both scale and bias terms improves training stability.
Tangent Normalization: Explicitly normalizing the tangent function $\frac{𝑑 𝑓_{𝜃^{-}} (𝒙_{𝑡}, 𝑑 𝑡)}{𝑑 𝑡}$ to control gradient variance.

1.6.2. Training Objective

To make the training objective more stable, we modify the training objective to:

ℒ_{sCM} (𝜃, 𝜑) = 𝐸_{𝒙_{𝑡}, 𝑡} [\frac{𝑒^{𝑤_{𝜑} (𝑡)}}{𝐷} {‖ 𝐹_{𝜃} (\frac{𝒙_{𝑡}}{𝜎_{𝑑}}, 𝑡) - 𝐹_{𝜃^{-}} (\frac{𝒙_{𝑡}}{𝜎_{𝑑}}, 𝑡) - \cos (𝑡) \frac{𝑑 𝑓_{𝜃^{-}} (𝒙_{𝑡}, 𝑡)}{𝑑 𝑡} ‖}_{2}^{2} - 𝑤_{𝜑} (𝑡)]

where $𝑤_{𝜑} (𝑡)$ is an adaptive weighting function that balances the loss across different time steps.

1.7. Consistency Trajectory Models

\frac{𝑑 𝒙_{𝑡}}{𝑑 𝑡} = - 𝑡 \nabla_{𝒙_{𝑡}} \log 𝑝_{𝑡} (𝒙_{𝑡}) = \frac{𝒙_{𝑡} - 𝐸_{𝑝_{𝑡 0} (𝒙_{0} | 𝒙_{𝑡})} [𝒙 | 𝒙_{𝑡}]}{𝑡}

where $𝑝_{𝑡 0} (𝒙 | 𝒙_{𝑡})$ is the probability distribution of the solution of the reverse-time stochastic process from time $𝑡$ to zero, initiated from $𝒙_{𝑡}$ . Here, $𝐸_{𝑝_{𝑡 0} (𝒙_{0} | 𝒙_{𝑡})} [𝒙 | 𝒙_{𝑡}] = 𝒙_{𝑡} + 𝑡 \nabla \log 𝑝_{𝑡} (𝒙_{𝑡})$ is the denoiser function (Tweedie's Formula), an alternative expression for the score function $\nabla \log 𝑝_{𝑡} (𝒙_{𝒕})$ . In practice, the denoiser $𝐸_{𝑝_{𝑡 0} (𝒙_{0} | 𝒙_{𝑡})} [𝒙 | 𝒙_{𝑡}]$ is approximated using a neural network $𝐷_{𝜑}$ , obtained by minimizing the DSM loss $𝐸_{𝒙_{0}, 𝑡, 𝑝_{0 𝑡} (𝒙 | 𝒙_{0})} [{‖ 𝒙_{0} - 𝐷_{𝜑} (𝒙_{𝑡}, 𝑡) ‖}_{2}^{2}]$ .

Sampling from DM involves solving the PF ODE, equivalent to computing the integral

\int_{𝑇}^{0} \frac{𝑑 𝒙_{𝑡}}{𝑑 𝑡} 𝑑 𝑡 = \int_{𝑇}^{0} \frac{𝒙_{𝑡} - 𝐷_{𝜑} (𝒙_{𝑡}, 𝑡)}{𝑡} 𝑑 𝑡 \Leftrightarrow 𝒙_{0} = 𝒙_{𝑇} + \int_{𝑇}^{0} \frac{𝒙_{𝑡} - 𝐷_{𝜑} (𝒙_{𝑡}, 𝑡)}{𝑡} 𝑑 𝑡

where $𝒙_{𝑇}$ is sampled from a prior distribution $𝜋$ approximating $𝑝_{𝑇}$ . Decoding strategies of DM primarily fall into two categories: score-based sampling with time-discretized numerical integral solvers, and distillation sampling where a neural network directly estimates the integral.

Score-based Sampling: Despite recent advancements in numerical solvers, further improvements may be challenging due to the inherent discretization error present in all solvers, ultimately limiting the sample quality obtained with few NFEs.
Distillation Sampling: Distillation models' multistep sampling approach exhibits degrading sample quality with increasing NFE, lacking a clear trade-off between computational budget (NFE) and sample fidelity. Furthermore, multistep sampling is not deterministic, leading to uncontrollable sample variance.

1.7.1. Trajectory Mapping Function

CTM learns a trajectory mapping function $𝐺_{𝜃} : (𝒙_{𝑡}, 𝑡, 𝑠) \mapsto 𝒙_{𝑠}$ that maps a point $𝒙_{𝑡}$ at time $𝑡$ to the corresponding point $𝒙_{𝑠}$ at time $𝑠$ along the same ODE trajectory.

Key properties:

Consistency: $𝐺_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑡) = 𝒙_{𝑡}$ (identity mapping)
Transitivity: $𝐺_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑠) = 𝐺_{𝜃} (𝐺_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑟), 𝑟, 𝑠)$ for any intermediate time $𝑟$
Boundary condition: $𝐺_{𝜃} (𝒙_{0}, 0, 𝑠) = 𝒙_{𝑠}$ where $𝒙_{𝑠}$ follows the forward process

For stable training, we express $𝐺$ as a mixture of $𝒙_{𝑡}$ and a function $𝑔$ (inspired from the Euler solver):

𝐺 (𝒙_{𝑡}, 𝑡, 𝑠) = \frac{𝑠}{𝑡} 𝒙_{𝑡} + (1 - \frac{𝑠}{𝑡}) 𝑔 (𝒙_{𝑡}, 𝑡, 𝑠)

where $𝑔 (𝒙_{𝑡}, 𝑡, 𝑠) = 𝒙_{𝑡} + \frac{𝑡}{𝑡 - 𝑠} \int_{𝑡}^{𝑠} \frac{𝒙_{𝑢} - 𝐸 [𝑥 | 𝑥_{𝑢}]}{𝑢} 𝑑 𝑢$ and we approximate $𝑔$ using $𝑔_{𝜃}$ with a neural network. A critical insight is that this parameterization also allows access to the score function. By taking the limit as $𝑠$ approaches $𝑡$ , we find:

\lim_{𝑠 \to 𝑡} 𝑔 (𝒙_{𝑡}, 𝑡, 𝑠) = 𝐸 [𝒙 | 𝒙_{𝑡}]

This means that CTM not only learns to make long jumps along the trajectory but also learns the infinitesimal jumps, i.e., the denoiser/score function. This property unifies score-based and distillation approaches within a single model.

1.7.2. CTM Training

CTM training combines a distillation loss with powerful auxiliary losses that provide direct training signals from the data, allowing the student (CTM) to surpass the teacher (diffusion model).

Soft Consistency Loss: The primary loss is a distillation loss where the CTM's jump prediction $𝐺_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑠)$ is matched against a target generated by a pre-trained teacher model. To make this efficient and effective, CTM uses soft consistency matching. The model is trained to enforce $𝐺_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑠) \approx 𝐺_{sg (𝜃)} (Solver (𝒙_{𝑡}, 𝑡, 𝑢; 𝜑), 𝑢, 𝑠)$ , where $𝑢$ is a random time between $𝑠$ and $𝑡$ . This serves as a flexible interpolation between local consistency ( $𝑢 = 𝑡 - Δ 𝑡$ , distilling a single step) and global consistency ( $𝑢 = 𝑠$ , distilling the entire interval).
Auxiliary Losses:
1. Denoising Score Matching (DSM) Loss: Since CTM can estimate the denoiser via $𝑔_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑡)$ , it is explicitly trained with a DSM loss: $ℒ_{DSM} (𝜃) = 𝐸 [{‖ 𝒙_{0} - 𝑔_{𝜃} (𝒙_{𝑡}, 𝑡, 𝑡) ‖}_{2}^{2}]$ . This regularizes the model to accurately learn the score function, which is vital for precision and for enabling score-based sampling methods.
2. Adversarial (GAN) Loss: To further enhance sample quality and refine details, an adversarial loss $ℒ_{GAN}$ is added, similar to VQGAN.

The final loss is a weighted sum: $ℒ ≔ ℒ_{CTM} + 𝜆_{DSM} ℒ_{DSM} + 𝜆_{GAN} ℒ_{GAN}$ .

1.7.3. $𝛾$ -Sampling

CTM's ability to travel between any two points in time enables a novel and flexible sampling scheme called $𝛾$ -sampling, controlled by a parameter $𝛾 \in [0, 1]$ . A sampling step from $𝑡_{𝑛}$ to $𝑡_{𝑛 + 1}$ involves:

Denoise: Jump from $𝒙_{𝑡_{𝑛}}$ to an intermediate time $\sqrt{1 - 𝛾^{2}} 𝑡_{𝑛 + 1}$ using the learned function $𝐺_{𝜃}$ .
Noisify: Add a controlled amount of noise to reach the noise level corresponding to time $𝑡_{𝑛 + 1}$ .

This process offers a spectrum of sampling behaviors:

$𝛾 = 0$ (Deterministic): The process is fully deterministic. It follows the PF ODE path directly, avoiding the discretization errors of traditional ODE solvers and the error accumulation of CM's multistep method. Sample quality consistently improves with more NFEs.
$𝛾 = 1$ (Fully Stochastic): This recovers the multistep sampling method used in Consistency Models. However, this approach suffers from error accumulation, and sample quality can degrade as NFE increases.
$0 < 𝛾 < 1$ (Hybrid): This provides a tuneable level of stochasticity, generalizing the stochastic samplers found in models like EDM.

1.8. Flow-Anchored Consistency Model (FACM)

Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow.

1.8.1. The Source of Instability: Losing the Flow Anchor

The practical implementation of the continuous-time CM objective via the training target $𝑇 = 𝑣 + (1 - 𝑡) \frac{𝑑 𝐹_{𝜃^{-}} (𝒙_{𝑡}, 𝑡)}{𝑑 𝑡}$ is notoriously unstable. The core of this instability lies in the target's self-referential nature, creating two fundamental, intertwined problems:

Missing Instantaneous Velocity Field Supervision: The target $𝑇$ explicitly depends on the ground-truth instantaneous velocity $𝑣$ . The CM objective, however, only enforces a loss on the final prediction $𝐹_{𝜃}$ (the average velocity). There is no explicit mechanism to ensure that the model's learned dynamics remains faithful to the underlying instantaneous velocity field $𝑣$ .
Self-Referential Derivative Estimation: The network is required to estimate its own derivative. Even if the network is pre-trained and initially provides a good approximation of the instantaneous velocity, the CM objective alone provides no continuous supervision to maintain this alignment.

1.8.2. Flow-Anchoring Principle

Stability can be achieved by explicitly anchoring the model in the very flow it is shortcutting. The most direct way to achieve this Flow-Anchoring is to re-introduce the explicit training of the instantaneous velocity field that defines the flow.

1.8.3. FACM Training Strategy

FACM employs a simple yet effective training strategy that mixes two complementary objectives:

ℒ_{FACM} = ℒ_{FM} + ℒ_{CM}

1.8.3.1. Flow Matching Loss (The Anchor)

This loss component anchors the model by regressing its output towards the instantaneous velocity $𝑣$ :

ℒ_{FM} (𝜃) = 𝐸 [{‖ 𝐹_{𝜃} (𝒙_{𝑡}, 𝑐_{FM}) - 𝑣 ‖}_{2}^{2} + ℒ_{cos} (𝐹_{𝜃} (𝒙_{𝑡}, 𝑐_{FM}), 𝑣)]

where $ℒ_{cos} (𝑎, 𝑏) = 1 - \frac{𝑎 \cdot 𝑏}{{‖ 𝑎 ‖}_{2} {‖ 𝑏 ‖}_{2}}$

1.8.3.2. Consistency Model Loss (The Accelerator)

This component acts as an accelerator, training the model to learn the generative shortcut. We interpret the consistency condition as a fixed-point problem:

𝐹_{𝜃} = 𝑇 (𝐹_{𝜃}), where 𝑇 (𝐹) = 𝑣 + (1 - 𝑡) \frac{𝑑 𝐹}{𝑑 𝑡}

First, we compute the consistency residual $𝑔$ of the stop-gradient model $𝐹_{𝜃^{-}}$ :

𝑔 = 𝐹_{𝜃^{-}} (𝒙_{𝑡}, 𝑐_{CM}) - 𝑇 (𝐹_{𝜃^{-}})

Then form a perturbed target:

𝑣_{tar} = 𝐹_{𝜃^{-}} (𝒙_{𝑡}, 𝑐_{CM}) - 𝛼 (𝑡) \cdot 𝑔 = (1 - 𝛼 (𝑡)) 𝐹_{𝜃^{-}} (𝒙_{𝑡}, 𝑐_{CM}) + 𝛼 (𝑡) 𝑇 (𝐹_{𝜃^{-}})

This formulation provides a stable, interpolated learning target between the current model's output and the ideal consistency target. The final CM loss component uses a norm L2 loss, $𝐿_{norm}$ , and is modulated by weighting functions $𝛼 (𝑡)$ and $𝛽 (𝑡)$ :

ℒ_{CM} (𝜃) = 𝐸 [𝛽 (𝑡) \cdot 𝐿_{norm} (𝐹_{𝜃} (𝑥_{𝑡}, 𝑐_{CM}), 𝑣_{tar})]

It is important to note that our specific choices for weighting and loss functions are designed to accelerate convergence, not as prerequisites for stability, which is guaranteed by the Flow-Anchoring principle.

1.8.4. Implementation of the Mixed Objective

1.8.4.1. Expanded Time Interval Strategy

We innovatively propose leveraging an expanded time domain to distinguish between the two tasks:

CM Task: Operates on the interval $𝑡 \in [0, 1]$ , using $𝑐_{CM} = 𝑡$
FM Task: Maps to the alternate interval $[1, 2]$ by setting $𝑐_{FM} = 2 - 𝑡$

This mapping ensures continuity at the boundary $𝑡 = 1$ :

\lim_{𝑡 \to 1^{-}} [𝑣 + (1 - 𝑡) \frac{𝑑 𝐹_{𝜃} (𝒙_{𝑡}, 𝑡)}{𝑑 𝑡}] = 𝑣

1.8.4.2. Auxiliary Condition Strategy

Alternatively, we can introduce a second time variable $𝑟$ , making the model's full conditioning a tuple of $(𝑡, 𝑟)$ :

$𝑐_{CM} = (𝑡, 1)$ : Model learns CM task (average velocity from $𝑡$ to $1$ )
$𝑐_{FM} = (𝑡, 𝑡)$ : Model learns FM task (instantaneous velocity at $𝑡$ )

🔒 Access Restricted

Access Control