Diffusion Transformers (DiTs) and Scalable Interpolant Transformers (SiT)

June 25, 2025

by Leonardo

1. Diffusion Transformers (DiTs)

An image is first compressed into a smaller spatial representation (a "latent") using a pre-trained VAE
Take the latent representation of an input $𝑧$ as input to DiT. "Patchify" the noise latent of size $𝐼 \times 𝐼 \times 𝐶$ into patches of size $𝑝$ and convert it into a sequence of patches of size ${(\frac{𝐼}{𝑝})}^{2}$
Then this sequence of tokens go through Transformer blocks. They explore three different designs for how to do generation conditioned on contextual information. Among three designs, adaLN (Adaptive layer norm)-Zero works out the best, better than in-context conditioning and cross-attention block. The scale and shift parameters, $𝛾$ and $𝛽$ , are regressed from the sum of the embedding vectors of $𝑡$ and $𝑐$ . The dimension-wise scaling parameters $𝛼$ is also regressed and applied immediately prior to any residual connections within the DiT block
The transformer decoder outputs noise predictions and an output diagonal covariance prediction

2. Scalable Interpolant Transformers (SiT)

SiT design space:

Time discretization: Discrete-time or continuous-time?
- Adopting a continuous-time training framework provides significant flexibility. It decouples the model's training process from the number of steps used during sampling, which allows one to trade off inference speed and sample quality after the model is already trained.
Model prediction: Score or velocity field?
- The choice of what the model predicts is critical. Training the model to predict the velocity field ( $𝑣 (𝑥, 𝑡)$ ) using a velocity loss ( $𝐿_{𝑣}$ ), or using an equivalent weighted score loss ( $𝐿_{𝑠_{𝜆}}$ ), leads to substantially better performance than predicting the standard score. This is because the velocity parameterization effectively compensates for the vanishing gradients that the standard score objective suffers from when the noise level is low (as $𝑡 \to 0$ ).
Interpolant: SBDM-VP, linear or GVP (Generalized VP)?
- Linear ( $𝛼_{𝑡} = 1 - 𝑡, 𝜎_{𝑡} = 𝑡$ ) and GVP ( $𝛼_{𝑡} = \cos (\frac{𝜋}{2} 𝑡), 𝜎_{𝑡} = \sin (\frac{𝜋}{2} 𝑡)$ ) outperform the standard SBDM-VP path used in many diffusion models. These superior paths are more direct and have a lower "transport cost," which simplifies the learning problem by reducing the curvature of the generation trajectories.
Sampler: ODE or SDE? Choose which diffusion coefficient?
- First, using a stochastic sampler (SDE) generally produces higher-quality final samples (lower FID scores) compared to a deterministic one (ODE), as it offers better theoretical control over the KL divergence.
- Second, the diffusion coefficient ( $𝑤_{𝑡}$ ) in the SDE sampler is a highly effective and tunable parameter. A major finding is that $𝑤_{𝑡}$ can be chosen and optimized after the model has been trained, without any retraining cost. By selecting a theoretically motivated $𝑤_{𝑡}$ that minimizes an upper bound on the KL divergence, the model's performance can be further improved.

3. MaskDiT

MaskDiT introduces a masked autoencoder (MAE) approach specifically designed for diffusion transformers. The key innovation lies in decomposing the traditional diffusion training objective into two complementary subtasks:

Score estimation on unmasked patches: The model learns to predict noise/velocity on visible image patches
MAE reconstruction on masked patches: The model reconstructs missing patches based on visible context

The training objective combines both tasks:

ℒ = ℒ_{DSM} + 𝜆 ℒ_{MAE}

where the denoising score matching loss is:

ℒ_{DSM} = 𝐸_{𝑥_{0} \sim 𝑝_{data}} 𝐸_{𝑛 \sim 𝑁 (0, 𝑡^{2} 𝐼)} 𝐸_{𝑚} {‖ (𝒟_{𝜃} ((𝑥_{0} + 𝑛) ⊙ (1 - 𝑚), 𝑡) - 𝑥_{0}) ⊙ (1 - 𝑚) ‖}^{2}

and the MAE reconstruction loss is:

ℒ_{MAE} = 𝐸_{𝑥_{0} \sim 𝑝_{data}} 𝐸_{𝑛 \sim 𝑁 (0, 𝑡^{2} 𝐼)} 𝐸_{𝑚} {‖ (𝒟_{𝜃} ((𝑥_{0} + 𝑛) ⊙ (1 - 𝑚), 𝑡) - (𝑥_{0} + 𝑛)) ⊙ 𝑚 ‖}^{2}

Here $𝑚$ represents the binary masking pattern, $⊙$ denotes element-wise multiplication, and $𝒟_{𝜃}$ is the diffusion transformer model.

4. SD-DiT

SD-DiT's architecture is built upon a decoupled encoder-decoder structure within a teacher-student scheme. This design separates the discriminative and generative learning processes. The model comprises a student encoder, a student decoder, and a teacher encoder whose weights are an exponential moving average (EMA) of the student encoder's weights.

The model is trained using two distinct objectives: a Generative Loss and a Discriminative Loss.

4.1. Generative Objective

The generative task is handled by the student branch (encoder and decoder). The student encoder processes only the visible patches of a noised and masked input image. To avoid the training-inference discrepancy, the decoder is fed the processed visible tokens along with the original, unmodified invisible patches, rather than learnable mask tokens. The objective is to denoise the full image using the standard EDM loss formulation.

The generative loss $ℒ_{𝐺}$ is defined as:

ℒ_{𝐺} = 𝐸_{𝑥_{0} \sim 𝑝_{data}} 𝐸_{𝑛 \sim 𝒩 (0, 𝜎_{𝑆}^{2} 𝐼)} {‖ 𝐷_{𝜃} (𝑥_{0} + 𝑛, 𝜎_{𝑆}, ℳ) - 𝑥_{0} ‖}_{2}^{2}

Here, $𝐷_{𝜃}$ represents the student branch, $𝜎_{𝑆}$ is the variable noise level for the student view, and $ℳ$ is the binary mask.

4.2. Discriminative Objective

The discriminative task aims to enforce inter-image alignment between the student and teacher encoder outputs in a shared embedding space. This is achieved by minimizing the cross-entropy loss between the softmax probability distributions of the student's visible tokens and the teacher's corresponding tokens.

The teacher view, $𝑥_{𝜎_{𝑇}}$ , is created using a fixed, minimal noise level ( $𝜎_{min}$ ) to serve as a high-quality, stable reference, a concept inspired by Consistency Models. For each visible token $𝑖$ , the loss is:

ℒ_{𝐷} (𝑖) = - \sum_{𝑘} 𝑃_{𝑇_{𝑖}} \log (𝑃_{𝑆_{𝑖}})

The total discriminative loss is averaged over all visible tokens and the [CLS] token:

ℒ_{𝐷} = \frac{1}{1 - ℳ} \sum_{𝑖 \in (1 - ℳ)} ℒ_{𝐷} (𝑖) + ℒ_{𝐷} ([CLS])

The final training objective for the student network is the sum of both losses:

ℒ_{total} = ℒ_{𝐺} + ℒ_{𝐷}

🔒 Access Restricted

Access Control