Consistency Models

1. Consistency Models

While diffusion models achieve remarkable sample quality, they require hundreds or thousands of function evaluations during generation, making them computationally expensive for real-time applications.

The key insight behind Consistency Models is to learn a direct mapping from noise to data, enabling single-step generation while preserving the high-quality outputs characteristic of diffusion models.

1.1. The Core Intuition

The fundamental idea of Consistency Models stems from a simple yet powerful observation: if we can learn to map any point on a diffusion trajectory directly to its corresponding clean data point, we can bypass the iterative denoising process entirely.

Consider the probability flow ODE that underlies diffusion models:

𝑑𝒙=[𝒇(𝒙,𝑑)βˆ’12𝑔(𝑑)2βˆ‡π’™log𝑝𝑑(𝒙)]𝑑𝑑

This ODE defines a deterministic trajectory from noise π’™π‘‡βˆΌπ’©(𝟎,𝐼) to data 𝒙0βˆΌπ‘Β data. A Consistency Model learns a function π‘“πœƒ:(𝒙𝑑,𝑑)↦𝒙0 that maps any point (𝒙𝑑,𝑑) on this trajectory to the trajectory's origin 𝒙0.

1.1.1. Self-Consistency Property

The defining characteristic of Consistency Models is the self-consistency property:

Definition: A function 𝑓:(𝒙,𝑑)↦𝒙 satisfies the self-consistency property if:

𝑓(𝒙𝑑,𝑑)=𝑓(𝒙𝑑′,𝑑′)βˆ€π’™π‘‘,𝒙𝑑′ onΒ theΒ sameΒ ODEΒ trajectory

In other words, for any two points on the same trajectory, the consistency function should map both to the same endpoint. This ensures that:

  1. 𝑓(𝒙0,0)=𝒙0 (identity mapping at the boundary)
  2. 𝑓(𝒙𝑑,𝑑)=𝒙0 for all 𝑑>0 (consistent prediction)

1.2. Mathematical Framework

1.2.1. Parameterization

To ensure the boundary condition 𝑓(𝒙0,0)=𝒙0, Consistency Models use a specific parameterization:

π‘“πœƒ(𝒙,𝑑)={𝒙 if 𝑑=0πΉπœƒ(𝒙,𝑑)Β if 𝑑>0

where πΉπœƒ is a deep neural network. In practice, this is implemented as:

π‘“πœƒ(𝒙,𝑑)=𝑐 skipΒ (𝑑)𝒙+𝑐 outΒ (𝑑)πΉπœƒ(𝒙,𝑑)

with 𝑐skipΒ (0)=1, 𝑐outΒ (0)=0, ensuring the boundary condition is satisfied.

1.2.2. Discretized Training

For computational tractability, the continuous time interval [0,𝑇] is discretized into 𝑁 timesteps:

0=𝑑0<𝑑1<𝑑2<…<π‘‘π‘βˆ’1<𝑑𝑁=𝑇

The consistency function is trained to satisfy:

π‘“πœƒ(𝒙𝑑𝑛,𝑑𝑛)=π‘“πœƒ(𝒙𝑑𝑛+1,𝑑𝑛+1)

for consecutive timesteps on the same trajectory.

1.3. Training Methods

There are two primary approaches to training Consistency Models, each with distinct advantages and trade-offs.

1.3.1. Consistency Distillation (CD)

Consistency Distillation leverages a pre-trained diffusion model (teacher) to generate training pairs for the consistency model (student).

Algorithm:

  1. Sample π’™βˆΌπ‘Β data and π‘›βˆΌUniform(1,π‘βˆ’1)
  2. Generate noisy sample: 𝒙𝑑𝑛+1=𝒙+𝑑𝑛+1𝒛 where π’›βˆΌπ’©(𝟎,𝐼)
  3. Use the pre-trained score model to estimate: π’™Μ‚π‘‘π‘›πœ‘=𝒙𝑑𝑛+1+(π‘‘π‘›βˆ’π‘‘π‘›+1)Ξ¦(𝒙𝑑𝑛+1,𝑑𝑛+1)
  4. Minimize the consistency loss:
β„’CD 𝑁(πœƒ,πœƒβˆ’;πœ‘)=𝐸[πœ†(𝑑𝑛)𝑑(π‘“πœƒ(𝒙𝑑𝑛+1,𝑑𝑛+1),π‘“πœƒβˆ’(π’™Μ‚π‘‘π‘›πœ‘,𝑑𝑛))]

where:

  • Ξ¦ represents the pre-trained score model parameters
  • πœƒβˆ’ is an exponential moving average of πœƒ: πœƒβˆ’β†stopgrad(πœ‡πœƒβˆ’+(1βˆ’πœ‡)πœƒ)
  • πœ†(𝑑) is a positive weighting function, typically πœ†(𝑑)=1
  • 𝑑(Β·,Β·) is a distance metric (e.g., β„“2 or LPIPS (learned perceptual image patch similarity))

1.3.2. Consistency Training (CT)

Consistency Training learns consistency models from scratch without relying on pre-trained diffusion models.

Algorithm:

  1. Sample π’™βˆΌπ‘Β data, π‘›βˆΌUniform(1,π‘βˆ’1), and π’›βˆΌπ’©(𝟎,𝐼)
  2. Define adjacent points on trajectory:

    • 𝒙𝑑𝑛+1=𝒙+𝑑𝑛+1𝒛
    • 𝒙𝑑𝑛=𝒙+𝑑𝑛𝒛
  3. Minimize the consistency loss:
β„’CT 𝑁(πœƒ,πœƒβˆ’)=𝐸[πœ†(𝑑𝑛)𝑑(π‘“πœƒ(𝒙𝑑𝑛+1,𝑑𝑛+1),π‘“πœƒβˆ’(𝒙𝑑𝑛,𝑑𝑛))]

According to the experiments in the paper, they found,

  • Heun ODE solver works better than Euler's first-order solver, since higher order ODE solvers have smaller estimation errors with the same 𝑁.
  • Among different options of the distance metric function 𝑑(β‹…,β‹…), the LPIPS metric works better than β„“1 and β„“2 distance.
  • Smaller 𝑁 leads to faster convergence but worse samples, whereas larger 𝑁 leads to slower convergence but better samples upon convergence.

1.3.3. Target Network Updates

Both training methods employ a target network π‘“πœƒβˆ’ to stabilize training. The target parameters are updated via exponential moving average:

πœƒβˆ’β†stopgrad(πœ‡πœƒβˆ’+(1βˆ’πœ‡)πœƒ)

where πœ‡βˆˆ[0,1) is the decay rate. This prevents the consistency loss from becoming degenerate (where the model simply outputs the same value for all inputs).

1.4. Sampling and Inference

1.4.1. Single-Step Sampling

The most straightforward sampling approach is single-step generation:

π’™Μ‚πœ€=π‘“πœƒ(𝒙̂𝑇,𝑇)Β whereΒ π’™Μ‚π‘‡βˆΌπ’©(𝟎,𝑇2𝐼)

The key insight is that π‘“πœƒ has learned to map any point on the diffusion trajectory back to the clean data, so we can start from pure noise and get the final result immediately.

1.4.2. Multi-Step Sampling

While single-step sampling is fast, multi-step sampling can improve quality by alternating denoising and noise injection steps:

This process allows trading computational cost for sample quality, similar to how diffusion models work but with far fewer steps.

1.5. Improved Consistency Training (iCT)

  1. Removing EMA for the Teacher Network: The authors identified a theoretical flaw where using Exponential Moving Average (EMA) for the teacher network provides no useful training signal for the data distribution. By setting the EMA decay rate to zero (πœ‡(π‘˜)=0), they ensure the teacher and student parameters are correctly aligned, which significantly boosts performance.
  2. Pseudo-Huber Loss: To replace the biased and computationally expensive LPIPS metric, the authors adopt the Pseudo-Huber loss, defined as 𝑑(π‘₯,𝑦)=β€–π‘₯βˆ’π‘¦β€–22+𝑐2βˆ’π‘. This simple metric is robust to outliers, reduces training variance, and ultimately surpasses the performance of LPIPS-based training.
  3. Improved Discretization Curriculum: The paper demonstrates that model performance scales with the number of discretization steps (𝑁) according to a power law. Based on this, they propose a new exponential curriculum for 𝑁(π‘˜)β€”doubling 𝑁 at fixed intervalsβ€”which empirically yields the best sample quality.
  4. Lognormal Noise Schedule: The default training procedure over-emphasizes high noise levels. The authors introduce a lognormal noise schedule to focus training on more critical low-to-mid noise ranges, leading to a notable improvement in sample quality.
  5. Optimized Hyperparameters: The work also introduces an improved weighting function (πœ†(πœŽπ‘–)=1πœŽπ‘–+1βˆ’πœŽπ‘–), increased dropout rates, and fine-tuned noise embeddings to further enhance performance.

1.6. Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models (sCM)

FigureΒ 3: Discrete-time CMs (top & middle) vs. continuous-time CMs (bottom). Discrete-time CMs suffer from discretization errors from numerical ODE solvers, causing imprecise predictions during training. In contrast, continuous-time CMs stay on the ODE trajectory by following its tangent direction with infinitesimal steps.
  • Discrete-time CMs The training objective is defined at two adjacent time steps with finite distance:

    𝐸𝒙𝑑,𝑑[𝑀(𝑑)𝑑(π‘“πœƒ(𝒙𝑑,𝑑),π‘“πœƒβˆ’(π’™π‘‘βˆ’Ξ”π‘‘,π‘‘βˆ’Ξ”π‘‘))]

    where πœƒβˆ’ denotes stopgrad(πœƒ), 𝑀(𝑑) is the weighting function, Δ𝑑>0 is the distance between two adjacent time steps, and 𝑑(β‹…,β‹…) is the distance metric.

  • Continuous-time CMs We use 𝑑(π‘₯,𝑦)=β€–π‘₯βˆ’π‘‘β€–2 and take Δ𝑑→0, we can show that

    limΔ𝑑→0𝐸𝒙𝑑,𝑑[𝑀(𝑑)𝑑(π‘“πœƒ(𝒙𝑑,𝑑),π‘“πœƒβˆ’(π’™π‘‘βˆ’Ξ”π‘‘,π‘‘βˆ’Ξ”π‘‘))]=βˆ‡πœƒπΈπ’™π‘‘,𝑑[𝑀(𝑑)π‘“πœƒπ‘‡(𝒙𝑑,𝑑)π‘‘π‘“πœƒβˆ’(𝒙𝑑,𝑑)𝑑𝑑]

    where π‘‘π‘“πœƒβˆ’(𝒙𝑑,𝑑)𝑑𝑑=βˆ‡π’™π‘‘π‘“πœƒβˆ’(𝒙𝑑,𝑑)𝑑𝒙𝑑𝑑𝑑+βˆ‚π‘‘π‘“πœƒβˆ’(𝒙𝑑,𝑑) is the tangent of π‘“πœƒβˆ’ at (𝒙𝑑,𝑑) along the trajectory of the PF-ODE 𝑑𝒙𝑑𝑑.

    Notably, continuous-time CMs do not rely on ODE solvers, which avoids discretization errors and offers more accurate supervision signals during training. However, previous work found that training continuous-time CMs, or even discrete-time CMs with an extremely small Δ𝑑, suffers from severe instability in optimization. This greatly limits the empirical performance and adoption of continuous-time CMs.

To address the stability issues of continuous-time consistency models, researchers introduced TrigFlow - a simplified theoretical framework that unifies EDM (Elucidated Diffusion Models) and Flow Matching while significantly simplifying the mathematical formulation.

TrigFlow uses trigonometric functions to parameterize the diffusion process, making the mathematical expressions much cleaner and more stable. The key insight is to replace the complex EDM coefficients with simple trigonometric functions:

Diffusion Process:

𝒙𝑑=cos(𝑑)𝒙0+sin(𝑑)𝒛

where π‘‘βˆˆ[0,πœ‹2] and π’›βˆΌπ’©(𝟎,πœŽπ‘‘2𝑰).

Diffusion Models and PF-ODE:

π‘“πœƒDMΒ (𝒙𝑑,𝑑)=πΉπœƒ(π’™π‘‘πœŽπ‘‘,𝑐 noiseΒ (𝑑))

where πΉπœƒ is a neural network with parameters πœƒ, and 𝑐noiseΒ (𝑑) is a transformation of 𝑑 to facilitate time conditioning. The corresponding PF-ODE is given by

𝑑𝒙𝑑𝑑𝑑=πœŽπ‘‘πΉπœƒ(π’™π‘‘πœŽπ‘‘,𝑐 noiseΒ (𝑑))

Training Target:

β„’DiffΒ (πœƒ)=𝐸π‘₯0,𝑧,𝑑[β€–πœŽπ‘‘πΉπœƒ(π’™π‘‘πœŽπ‘‘,𝑐 noiseΒ (𝑑))βˆ’π‘£π‘‘β€–22]𝒗𝑑=cos(𝑑)π’›βˆ’sin(𝑑)𝒙0

Consistency Model Parameterization:

𝑐skipΒ (𝑑)=cos(𝑑),𝑐 outΒ (𝑑)=βˆ’πœŽπ‘‘sin(𝑑),𝑐 inΒ (𝑑)=1πœŽπ‘‘π‘“πœƒ(𝒙𝑑,𝑑)=cos(𝑑)π’™π‘‘βˆ’sin(𝑑)πœŽπ‘‘πΉπœƒ(π’™π‘‘πœŽπ‘‘,𝑐noise(𝑑))

1.6.1. Stabilization Techniques

The TrigFlow framework incorporates several key stabilization techniques:

  • Identity Time Transformation: Using 𝑐noiseΒ (𝑑)=𝑑 instead of the complex logarithmic transformation from EDM prevents numerical blow-up as π‘‘β†’πœ‹2.
  • Positional Time Embeddings: Avoiding high-frequency Fourier embeddings in favor of positional embeddings reduces gradient instability.
  • Adaptive Double Normalization: Modifying the AdaGN layers to use pixel normalization for both scale and bias terms improves training stability.
  • Tangent Normalization: Explicitly normalizing the tangent function π‘‘π‘“πœƒβˆ’(𝒙𝑑,𝑑𝑑)𝑑𝑑 to control gradient variance.

1.6.2. Training Objective

To make the training objective more stable, we modify the training objective to:

β„’sCMΒ (πœƒ,πœ‘)=𝐸𝒙𝑑,𝑑[π‘’π‘€πœ‘(𝑑)π·β€–πΉπœƒ(π’™π‘‘πœŽπ‘‘,𝑑)βˆ’πΉπœƒβˆ’(π’™π‘‘πœŽπ‘‘,𝑑)βˆ’cos(𝑑)π‘‘π‘“πœƒβˆ’(𝒙𝑑,𝑑)𝑑𝑑‖22βˆ’π‘€πœ‘(𝑑)]

where π‘€πœ‘(𝑑) is an adaptive weighting function that balances the loss across different time steps.

1.7. Consistency Trajectory Models

𝑑𝒙𝑑𝑑𝑑=βˆ’π‘‘βˆ‡π’™π‘‘log𝑝𝑑(𝒙𝑑)=π’™π‘‘βˆ’πΈπ‘π‘‘0(𝒙0|𝒙𝑑)[𝒙|𝒙𝑑]𝑑

where 𝑝𝑑0(𝒙|𝒙𝑑) is the probability distribution of the solution of the reverse-time stochastic process from time 𝑑 to zero, initiated from 𝒙𝑑. Here, 𝐸𝑝𝑑0(𝒙0|𝒙𝑑)[𝒙|𝒙𝑑]=𝒙𝑑+π‘‘βˆ‡log𝑝𝑑(𝒙𝑑) is the denoiser function (Tweedie's Formula), an alternative expression for the score function βˆ‡log𝑝𝑑(𝒙𝒕). In practice, the denoiser 𝐸𝑝𝑑0(𝒙0|𝒙𝑑)[𝒙|𝒙𝑑] is approximated using a neural network π·πœ‘, obtained by minimizing the DSM loss 𝐸𝒙0,𝑑,𝑝0𝑑(𝒙|𝒙0)[‖𝒙0βˆ’π·πœ‘(𝒙𝑑,𝑑)β€–22].

Sampling from DM involves solving the PF ODE, equivalent to computing the integral

βˆ«π‘‡0𝑑𝒙𝑑𝑑𝑑𝑑𝑑=βˆ«π‘‡0π’™π‘‘βˆ’π·πœ‘(𝒙𝑑,𝑑)𝑑𝑑𝑑⇔𝒙0=𝒙𝑇+βˆ«π‘‡0π’™π‘‘βˆ’π·πœ‘(𝒙𝑑,𝑑)𝑑𝑑𝑑

where 𝒙𝑇 is sampled from a prior distribution πœ‹ approximating 𝑝𝑇. Decoding strategies of DM primarily fall into two categories: score-based sampling with time-discretized numerical integral solvers, and distillation sampling where a neural network directly estimates the integral.

  • Score-based Sampling: Despite recent advancements in numerical solvers, further improvements may be challenging due to the inherent discretization error present in all solvers, ultimately limiting the sample quality obtained with few NFEs.
  • Distillation Sampling: Distillation models' multistep sampling approach exhibits degrading sample quality with increasing NFE, lacking a clear trade-off between computational budget (NFE) and sample fidelity. Furthermore, multistep sampling is not deterministic, leading to uncontrollable sample variance.

1.7.1. Trajectory Mapping Function

CTM learns a trajectory mapping function πΊπœƒ:(𝒙𝑑,𝑑,𝑠)↦𝒙𝑠 that maps a point 𝒙𝑑 at time 𝑑 to the corresponding point 𝒙𝑠 at time 𝑠 along the same ODE trajectory.

Key properties:

  1. Consistency: πΊπœƒ(𝒙𝑑,𝑑,𝑑)=𝒙𝑑 (identity mapping)
  2. Transitivity: πΊπœƒ(𝒙𝑑,𝑑,𝑠)=πΊπœƒ(πΊπœƒ(𝒙𝑑,𝑑,π‘Ÿ),π‘Ÿ,𝑠) for any intermediate time π‘Ÿ
  3. Boundary condition: πΊπœƒ(𝒙0,0,𝑠)=𝒙𝑠 where 𝒙𝑠 follows the forward process

For stable training, we express 𝐺 as a mixture of 𝒙𝑑 and a function 𝑔 (inspired from the Euler solver):

𝐺(𝒙𝑑,𝑑,𝑠)=𝑠𝑑𝒙𝑑+(1βˆ’π‘ π‘‘)𝑔(𝒙𝑑,𝑑,𝑠)

where 𝑔(𝒙𝑑,𝑑,𝑠)=𝒙𝑑+π‘‘π‘‘βˆ’π‘ βˆ«π‘‘π‘ π’™π‘’βˆ’πΈ[π‘₯|π‘₯𝑒]𝑒𝑑𝑒 and we approximate 𝑔 using π‘”πœƒ with a neural network. A critical insight is that this parameterization also allows access to the score function. By taking the limit as 𝑠 approaches 𝑑, we find:

lim𝑠→𝑑𝑔(𝒙𝑑,𝑑,𝑠)=𝐸[𝒙|𝒙𝑑]

This means that CTM not only learns to make long jumps along the trajectory but also learns the infinitesimal jumps, i.e., the denoiser/score function. This property unifies score-based and distillation approaches within a single model.

1.7.2. CTM Training

CTM training combines a distillation loss with powerful auxiliary losses that provide direct training signals from the data, allowing the student (CTM) to surpass the teacher (diffusion model).

  • Soft Consistency Loss: The primary loss is a distillation loss where the CTM's jump prediction πΊπœƒ(𝒙𝑑,𝑑,𝑠) is matched against a target generated by a pre-trained teacher model. To make this efficient and effective, CTM uses soft consistency matching. The model is trained to enforce πΊπœƒ(𝒙𝑑,𝑑,𝑠)β‰ˆπΊsg(πœƒ)(Solver(𝒙𝑑,𝑑,𝑒;πœ‘),𝑒,𝑠), where 𝑒 is a random time between 𝑠 and 𝑑. This serves as a flexible interpolation between local consistency (𝑒=π‘‘βˆ’Ξ”π‘‘, distilling a single step) and global consistency (𝑒=𝑠, distilling the entire interval).

  • Auxiliary Losses:

    1. Denoising Score Matching (DSM) Loss: Since CTM can estimate the denoiser via π‘”πœƒ(𝒙𝑑,𝑑,𝑑), it is explicitly trained with a DSM loss: β„’DSMΒ (πœƒ)=𝐸[‖𝒙0βˆ’π‘”πœƒ(𝒙𝑑,𝑑,𝑑)β€–22]. This regularizes the model to accurately learn the score function, which is vital for precision and for enabling score-based sampling methods.
    2. Adversarial (GAN) Loss: To further enhance sample quality and refine details, an adversarial loss β„’GAN is added, similar to VQGAN.

The final loss is a weighted sum: ℒ≔ℒ CTMΒ +πœ†Β DSMΒ β„’Β DSMΒ +πœ†Β GANΒ β„’Β GAN.

1.7.3. 𝛾-Sampling

CTM's ability to travel between any two points in time enables a novel and flexible sampling scheme called 𝛾-sampling, controlled by a parameter π›Ύβˆˆ[0,1]. A sampling step from 𝑑𝑛 to 𝑑𝑛+1 involves:

  1. Denoise: Jump from 𝒙𝑑𝑛 to an intermediate time 1βˆ’π›Ύ2𝑑𝑛+1 using the learned function πΊπœƒ.
  2. Noisify: Add a controlled amount of noise to reach the noise level corresponding to time 𝑑𝑛+1.

This process offers a spectrum of sampling behaviors:

  • 𝛾=0 (Deterministic): The process is fully deterministic. It follows the PF ODE path directly, avoiding the discretization errors of traditional ODE solvers and the error accumulation of CM's multistep method. Sample quality consistently improves with more NFEs.
  • 𝛾=1 (Fully Stochastic): This recovers the multistep sampling method used in Consistency Models. However, this approach suffers from error accumulation, and sample quality can degrade as NFE increases.
  • 0<𝛾<1 (Hybrid): This provides a tuneable level of stochasticity, generalizing the stochastic samplers found in models like EDM.

1.8. Flow-Anchored Consistency Model (FACM)

Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow.

1.8.1. The Source of Instability: Losing the Flow Anchor

The practical implementation of the continuous-time CM objective via the training target 𝑇=𝑣+(1βˆ’π‘‘)π‘‘πΉπœƒβˆ’(𝒙𝑑,𝑑)𝑑𝑑 is notoriously unstable. The core of this instability lies in the target's self-referential nature, creating two fundamental, intertwined problems:

  1. Missing Instantaneous Velocity Field Supervision: The target 𝑇 explicitly depends on the ground-truth instantaneous velocity 𝑣. The CM objective, however, only enforces a loss on the final prediction πΉπœƒ (the average velocity). There is no explicit mechanism to ensure that the model's learned dynamics remains faithful to the underlying instantaneous velocity field 𝑣.

  2. Self-Referential Derivative Estimation: The network is required to estimate its own derivative. Even if the network is pre-trained and initially provides a good approximation of the instantaneous velocity, the CM objective alone provides no continuous supervision to maintain this alignment.

1.8.2. Flow-Anchoring Principle

Stability can be achieved by explicitly anchoring the model in the very flow it is shortcutting. The most direct way to achieve this Flow-Anchoring is to re-introduce the explicit training of the instantaneous velocity field that defines the flow.

1.8.3. FACM Training Strategy

FACM employs a simple yet effective training strategy that mixes two complementary objectives:

β„’FACMΒ =β„’Β FMΒ +β„’Β CM
1.8.3.1. Flow Matching Loss (The Anchor)

This loss component anchors the model by regressing its output towards the instantaneous velocity 𝑣:

β„’FMΒ (πœƒ)=𝐸[β€–πΉπœƒ(𝒙𝑑,𝑐 FM)βˆ’π‘£β€–22+β„’Β cosΒ (πΉπœƒ(𝒙𝑑,𝑐 FM),𝑣)]

where β„’cosΒ (π‘Ž,𝑏)=1βˆ’π‘Žβ‹…π‘β€–π‘Žβ€–2‖𝑏‖2

1.8.3.2. Consistency Model Loss (The Accelerator)

This component acts as an accelerator, training the model to learn the generative shortcut. We interpret the consistency condition as a fixed-point problem:

πΉπœƒ=𝑇(πΉπœƒ),Β where 𝑇(𝐹)=𝑣+(1βˆ’π‘‘)𝑑𝐹𝑑𝑑

First, we compute the consistency residual 𝑔 of the stop-gradient model πΉπœƒβˆ’:

𝑔=πΉπœƒβˆ’(𝒙𝑑,𝑐 CM)βˆ’π‘‡(πΉπœƒβˆ’)

Then form a perturbed target:

𝑣tarΒ =πΉπœƒβˆ’(𝒙𝑑,𝑐 CM)βˆ’π›Ό(𝑑)⋅𝑔=(1βˆ’π›Ό(𝑑))πΉπœƒβˆ’(𝒙𝑑,𝑐 CM)+𝛼(𝑑)𝑇(πΉπœƒβˆ’)

This formulation provides a stable, interpolated learning target between the current model's output and the ideal consistency target. The final CM loss component uses a norm L2 loss, 𝐿norm, and is modulated by weighting functions 𝛼(𝑑) and 𝛽(𝑑):

β„’CMΒ (πœƒ)=𝐸[𝛽(𝑑)⋅𝐿 normΒ (πΉπœƒ(π‘₯𝑑,𝑐 CM),𝑣 tar)]

It is important to note that our specific choices for weighting and loss functions are designed to accelerate convergence, not as prerequisites for stability, which is guaranteed by the Flow-Anchoring principle.

1.8.4. Implementation of the Mixed Objective

1.8.4.1. Expanded Time Interval Strategy

We innovatively propose leveraging an expanded time domain to distinguish between the two tasks:

  • CM Task: Operates on the interval π‘‘βˆˆ[0,1], using 𝑐CMΒ =𝑑
  • FM Task: Maps to the alternate interval [1,2] by setting 𝑐FMΒ =2βˆ’π‘‘

This mapping ensures continuity at the boundary 𝑑=1:

lim𝑑→1βˆ’[𝑣+(1βˆ’π‘‘)π‘‘πΉπœƒ(𝒙𝑑,𝑑)𝑑𝑑]=𝑣
1.8.4.2. Auxiliary Condition Strategy

Alternatively, we can introduce a second time variable π‘Ÿ, making the model's full conditioning a tuple of (𝑑,π‘Ÿ):

  • 𝑐CMΒ =(𝑑,1): Model learns CM task (average velocity from 𝑑 to 1)
  • 𝑐FMΒ =(𝑑,𝑑): Model learns FM task (instantaneous velocity at 𝑑)

References

  1. Consistency Models
  2. What are Diffusion Models?
  3. Improved Techniques for Training Consistency Models
  4. Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
  5. Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
  6. Flow-Anchored Consistency Model