Diffusion Transformers (DiTs) and Scalable Interpolant Transformers (SiT)

1. Diffusion Transformers (DiTs)

  1. An image is first compressed into a smaller spatial representation (a "latent") using a pre-trained VAE
  2. Take the latent representation of an input 𝑧 as input to DiT. "Patchify" the noise latent of size 𝐼×𝐼×𝐢 into patches of size 𝑝 and convert it into a sequence of patches of size (𝐼𝑝)2
  3. Then this sequence of tokens go through Transformer blocks. They explore three different designs for how to do generation conditioned on contextual information. Among three designs, adaLN (Adaptive layer norm)-Zero works out the best, better than in-context conditioning and cross-attention block. The scale and shift parameters, 𝛾 and 𝛽, are regressed from the sum of the embedding vectors of 𝑑 and 𝑐. The dimension-wise scaling parameters 𝛼 is also regressed and applied immediately prior to any residual connections within the DiT block
  4. The transformer decoder outputs noise predictions and an output diagonal covariance prediction

2. Scalable Interpolant Transformers (SiT)

SiT design space:

  1. Time discretization: Discrete-time or continuous-time?

    • Adopting a continuous-time training framework provides significant flexibility. It decouples the model's training process from the number of steps used during sampling, which allows one to trade off inference speed and sample quality after the model is already trained.
  2. Model prediction: Score or velocity field?

    • The choice of what the model predicts is critical. Training the model to predict the velocity field (𝑣(π‘₯,𝑑)) using a velocity loss (𝐿𝑣), or using an equivalent weighted score loss (πΏπ‘ πœ†), leads to substantially better performance than predicting the standard score. This is because the velocity parameterization effectively compensates for the vanishing gradients that the standard score objective suffers from when the noise level is low (as 𝑑→0).
  3. Interpolant: SBDM-VP, linear or GVP (Generalized VP)?

    • Linear (𝛼𝑑=1βˆ’π‘‘,πœŽπ‘‘=𝑑) and GVP (𝛼𝑑=cos(πœ‹2𝑑),πœŽπ‘‘=sin(πœ‹2𝑑)) outperform the standard SBDM-VP path used in many diffusion models. These superior paths are more direct and have a lower "transport cost," which simplifies the learning problem by reducing the curvature of the generation trajectories.
  4. Sampler: ODE or SDE? Choose which diffusion coefficient?

    • First, using a stochastic sampler (SDE) generally produces higher-quality final samples (lower FID scores) compared to a deterministic one (ODE), as it offers better theoretical control over the KL divergence.
    • Second, the diffusion coefficient (𝑀𝑑) in the SDE sampler is a highly effective and tunable parameter. A major finding is that 𝑀𝑑 can be chosen and optimized after the model has been trained, without any retraining cost. By selecting a theoretically motivated 𝑀𝑑 that minimizes an upper bound on the KL divergence, the model's performance can be further improved.

3. MaskDiT

MaskDiT introduces a masked autoencoder (MAE) approach specifically designed for diffusion transformers. The key innovation lies in decomposing the traditional diffusion training objective into two complementary subtasks:

  1. Score estimation on unmasked patches: The model learns to predict noise/velocity on visible image patches
  2. MAE reconstruction on masked patches: The model reconstructs missing patches based on visible context

The training objective combines both tasks:

β„’=β„’Β DSMΒ +πœ†β„’Β MAE

where the denoising score matching loss is:

β„’DSMΒ =𝐸π‘₯0βˆΌπ‘Β dataπΈπ‘›βˆΌπ‘(0,𝑑2𝐼)πΈπ‘šβ€–(π’Ÿπœƒ((π‘₯0+𝑛)βŠ™(1βˆ’π‘š),𝑑)βˆ’π‘₯0)βŠ™(1βˆ’π‘š)β€–2

and the MAE reconstruction loss is:

β„’MAEΒ =𝐸π‘₯0βˆΌπ‘Β dataπΈπ‘›βˆΌπ‘(0,𝑑2𝐼)πΈπ‘šβ€–(π’Ÿπœƒ((π‘₯0+𝑛)βŠ™(1βˆ’π‘š),𝑑)βˆ’(π‘₯0+𝑛))βŠ™π‘šβ€–2

Here π‘š represents the binary masking pattern, βŠ™ denotes element-wise multiplication, and π’Ÿπœƒ is the diffusion transformer model.

4. SD-DiT

SD-DiT's architecture is built upon a decoupled encoder-decoder structure within a teacher-student scheme. This design separates the discriminative and generative learning processes. The model comprises a student encoder, a student decoder, and a teacher encoder whose weights are an exponential moving average (EMA) of the student encoder's weights.

The model is trained using two distinct objectives: a Generative Loss and a Discriminative Loss.

4.1. Generative Objective

The generative task is handled by the student branch (encoder and decoder). The student encoder processes only the visible patches of a noised and masked input image. To avoid the training-inference discrepancy, the decoder is fed the processed visible tokens along with the original, unmodified invisible patches, rather than learnable mask tokens. The objective is to denoise the full image using the standard EDM loss formulation.

The generative loss ℒ𝐺 is defined as:

ℒ𝐺=𝐸π‘₯0βˆΌπ‘Β dataπΈπ‘›βˆΌπ’©(0,πœŽπ‘†2𝐼)β€–π·πœƒ(π‘₯0+𝑛,πœŽπ‘†,β„³)βˆ’π‘₯0β€–22

Here, π·πœƒ represents the student branch, πœŽπ‘† is the variable noise level for the student view, and β„³ is the binary mask.

4.2. Discriminative Objective

The discriminative task aims to enforce inter-image alignment between the student and teacher encoder outputs in a shared embedding space. This is achieved by minimizing the cross-entropy loss between the softmax probability distributions of the student's visible tokens and the teacher's corresponding tokens.

The teacher view, π‘₯πœŽπ‘‡, is created using a fixed, minimal noise level (𝜎min) to serve as a high-quality, stable reference, a concept inspired by Consistency Models. For each visible token 𝑖, the loss is:

ℒ𝐷(𝑖)=βˆ’βˆ‘π‘˜π‘ƒπ‘‡π‘–log(𝑃𝑆𝑖)

The total discriminative loss is averaged over all visible tokens and the [CLS] token:

ℒ𝐷=11βˆ’β„³βˆ‘π‘–βˆˆ(1βˆ’β„³)ℒ𝐷(𝑖)+ℒ𝐷([CLS])

The final training objective for the student network is the sum of both losses:

β„’totalΒ =ℒ𝐺+ℒ𝐷

References

  1. What are Diffusion Models?
  2. Scalable Diffusion Models with Transformers
  3. SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
  4. Fast Training of Diffusion Models with Masked Transformers
  5. SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer