Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

June 25, 2025

by Leonardo

1. REPresentation Alignment (REPA)

The main challenge in training diffusion models stems from the need to learn a high-quality internal representation h. We demonstrate that the training process for generative diffusion models becomes significantly easier and more effective when supported by an external representation, $𝑦^{*}$ . Specifically, we propose a simple regularization technique that leverages recent advances in self-supervised visual representations as $𝑦^{*}$ , leading to substantial improvements in both training efficiency and the generation quality of diffusion transformers.

These insights inspire us to enhance generative models by incorporating external self-supervised representations. However, the problems are

Input mismatch: diffusion models work with noisy inputs while most self-supervised learning encoders are trained on clean images
These off-the-shelf vision encoders are not designed for tasks like reconstruction or generation

To overcome these technical hurdles, we guide the feature learning of diffusion models using a regularization technique called REPresentation Alignment (REPA) that distills pretrained self-supervised representations into diffusion representations, offering a flexible way to integrate high-quality representations.

Let $𝑓$ be a pretrained encoder, $𝑥_{*}$ be a clean image. REPA aligns $ℎ_{𝜑} (ℎ_{𝑡})$ with $𝑦_{*}$ where $ℎ_{𝜑} (ℎ_{𝑡})$ is a projection of an transformer encoder output $ℎ_{𝑡} = 𝑓_{𝜃} (𝑧_{𝑡})$ that through a trainable projection head $ℎ_{𝜑}$ .

ℒ_{REPA} (𝜃, 𝜑) = - 𝐸_{𝑥_{*}, 𝜀, 𝑡} [\frac{1}{𝑁} \sum_{𝑛 = 1}^{𝑁} sim (𝑦_{*}^{[𝑛]}, ℎ_{𝜑} (ℎ_{𝑡}^{[𝑛]}))]

2. Self-Representation Alignment (SRA)

Figure 2: Left: Methods like MaskDiT and SD-DiT use an external representation task to guide diffusion transformer. Middle: Methods like REPA leverage an external representation foundation model as guidance. Right (our approach): We do not use any external representation component but still obtain such guidance through proposed self-representation alignment technique.

SRA attains self-alignment by minimizing the patch-wise distance between the teacher's output ( $𝑦_{*}$ ) and the student's output variant ( $𝑗_{𝜓} (𝑦)$ ):

ℒ_{sa} (𝜁, 𝜓) = 𝔼_{𝑥_{𝑡}, 𝑡, 𝑐} [\frac{1}{𝑁} \sum_{𝑛 = 1}^{𝑁} dist (𝑦_{*}^{[𝑖]}, 𝑗_{𝜓} (𝑦^{[𝑖]}))]

where $[𝑛]$ is a patch index, $dist (\cdot, \cdot)$ is a pre-defined distance calculation function, and $𝜁$ , $𝜑$ is the parameters of student diffusion transformer and the projection head. This objective is similar to that mentioned in REPA, except that we're aiming to align with DIT's own teacher instead of external transformer itself. The final loss is:

ℒ = ℒ_{gen} + 𝜆 ℒ_{sa}

Figure 3: SRA aligns the student's latent representation in the earlier layer conditioned on higher noise (green branch) to that of the teacher in the later layer conditioned on lower noise (blue branch) to achieve self-representation alignment. We use a stop-gradient (sg) operator on the teacher to let gradients flow only through the student, and update the teacher's parameters with an exponential moving average (ema) of the student's parameters.

3. LayerSync

Figure 4: LayerSync improves training efficiency and generation quality via internal representation alignment.

4. Velocity Refiner with Acceleration (VeRA)

Existing flow-based generative models, such as SiT, primarily utilize only the final layer's output for velocity prediction, under-utilizing rich intermediate representations. This under-utilization of internal features leads to slower model convergence and suboptimal generation performance in flow-based models.

4.1. Deep Supervision

DeepFlow employs deep supervision by inserting auxiliary velocity layers after selected intermediate transformer blocks. The corresponding deep supervision loss at these key transformer layers is defined as follows:

ℒ_{deep} = 𝐸 [\sum_{𝑖 = 1}^{𝑘} 𝛽^{𝑖} ({‖ 𝑣_{𝜃}^{𝑖} (𝑥_{𝑡}^{𝑖}, 𝑡, 𝑐) - 𝑉 ‖}^{2})]

This approach encourages intermediate layers to develop meaningful velocity representations earlier in the network, improving gradient flow and feature alignment across layers.

4.2. Velocity Refiner with Acceleration (VeRA) Block

Positioned between adjacent branches, it refines velocity features from preceding layers and aligns them with subsequent processing stages. The block comprises three main components:

Acceleration Generation and Second-Order ODE Training

The VeRA block begins by generating an "acceleration feature" from the input velocity feature using a Multi-Layer Perceptron (ACC_MLP):

𝑎_{𝑡_{1}}^{*} = ACC_MLP (𝑣_{𝑡_{1}}^{*})

Then we can endow $𝑎_{𝑡}^{*}$ with acceleration property using a second-order ordinary differential equation ( $2^{nd}$ -ODE) as following equation:

\begin{matrix} ℒ_{acc} = 𝐸 [‖ 2^{nd} -ODE (𝑥_{𝑡_{1}}, 𝑣_{𝑡_{1}}, 𝑎_{𝑡_{1}}, 𝑑_{𝑡_{1} \to 0}) - 𝑥_{0} ‖], \\ 2^{nd} -ODE = 𝑥_{𝑡_{1}} + 𝑣_{𝑡_{1}} ⊙ 𝑑_{𝑡_{1} \to 0} + \frac{1}{2} 𝑎_{𝑡_{1}} ⊙ {(𝑑_{𝑡_{1} \to 0})}^{2} \end{matrix}

Feature Concatenation and Time-gap Conditioning

After computing the acceleration features ( $𝑎_{𝑡}^{*}$ ), we concatenate these with the original velocity features ( $𝑣_{𝑡}^{*}$ ). To enable this concatenated feature to be aware of time-gap, we apply a time-gap-conditioned adaptive layer normalization with a following MLP as below:

modulate (𝑣_{𝑡_{1}}^{*}) = MLP (AdaLN-Zero (concat (𝑣_{𝑡_{1}}^{*}, 𝑎_{𝑡_{1}}^{*}), 𝑇 (𝑑_{𝑡_{1} \to 𝑡_{2}})))

Spatial Information Integration via Cross-Attention

Beyond feature alignment with temporal property using different time-steps, the VeRA block also integrates spatial context by employing a cross-attention (CA) mechanism. This mechanism facilitates interaction between two spaces: modulated velocity feature space from previous step and spatial feature space from an original patchified image as noted in following equation.

𝑣_{𝑡_{1} \to 𝑡_{2}}^{*} = CA (modulate (𝑣_{𝑡_{1}}^{*}), 𝑥_{𝑡_{1}})

DeepFlow incorporates deep supervision and VeRA block

ℒ_{total} = ℒ_{deep} + 𝜆 ℒ_{acc}

5. Decoupled Diffusion Transformer (DDT)

The authors of "Decoupled Diffusion Transformer" (DDT) identify a fundamental "optimization dilemma" in existing diffusion transformers. In conventional architectures, the same neural network modules must simultaneously handle two conflicting tasks: encoding low-frequency semantic information from noisy inputs and decoding high-frequency details for reconstruction. This dual responsibility creates optimization tension because effective semantic encoding requires suppressing high-frequency noise, while detail reconstruction demands preserving and generating high-frequency components.

DDT resolves this optimization conflict through a principled decoupled encoder-decoder architecture that explicitly separates semantic extraction from detail reconstruction:

5.1. Condition Encoder

The condition encoder specializes in extracting low-frequency semantic components from three inputs: the noisy latent image $𝑥_{𝑡}$ , timestep $𝑡$ , and class label $𝑦$ . It employs interleaved attention and feed-forward blocks similar to DiT/SiT, with timestep and class conditioning via AdaLN-Zero injection.

Crucially, the encoder receives direct supervision through Representation Alignment (REPAlign), which enforces consistency between extracted features and pre-trained vision foundation model representations (DINOv2).

5.2. Velocity Decoder

The velocity decoder focuses on processing the noisy latent $𝑥_{𝑡}$ alongside the self-condition feature $𝑧_{𝑡}$ to predict the velocity field $𝑣_{𝑡}$ . Operating within a linear flow diffusion framework, it minimizes the flow matching loss.

5.3. Inference Acceleration Through Encoder Sharing

A key innovation of DDT is leveraging the consistency of extracted semantic features. $𝑧_{𝑡}$ across adjacent timesteps for inference acceleration. The authors propose two strategies:

Uniform Encoder Sharing: Recomputing $𝑧_{𝑡}$ at fixed intervals (every K steps) rather than every denoising step.
Statistical Dynamic Programming: A more sophisticated approach that frames optimal sharing as a minimal-cost path problem. Using a pre-computed similarity matrix $𝑆$ of cosine distances between $𝑧_{𝑡}$ features across timesteps, dynamic programming finds the optimal recalculation schedule that maximizes sharing while minimizing performance degradation.

6. Representation Entanglement for Generation (REG)

Introduce the class token ${cls}_{0} = 𝑓_{0} [0]$ from the vision foundation model to entangle with image latents for providing the discriminative guidance. ${cls}_{𝑡} = 𝛼_{𝑡} {cls}_{0} + 𝜎_{𝑡} 𝜀_{cls}$ .

The prediction loss is formulated as:

ℒ_{pred} = \int 𝐸 [{‖ 𝑣 (𝑧_{𝑡}, 𝑡) - {\dot{𝛼}}_{𝑡} 𝑧_{0} - {\dot{𝜎}}_{𝑡} 𝜀_{𝑧} ‖}^{2} + 𝛽 {‖ 𝑣 ({cls}_{𝑡}, 𝑡) - {\dot{𝛼}}_{𝑡} {cls}_{0} - {\dot{𝜎}}_{𝑡} 𝜀_{cls} ‖}^{2}] 𝑑 𝑡

The total loss is formulated as:

ℒ_{total} = ℒ_{pred} + 𝜆 ℒ_{REPA}

7. ReDi

Rather than aligning diffusion features with external representations via distillation, we propose to jointly model both images (specifically their VAE latents) and their high-level semantic features extracted from a pretrained vision encoder (e.g., DINOv2) within the same diffusion process.

Figure 10: Given an input image, the VAE latent and the principal components of DINOv2 are extracted. Then both modalities are noised and fused into a joint token sequence which is given as input to DiT or SiT.

7.1. Representation Guidance

To ensure the generated images remain strongly influenced by the visual representations during inference, we introduce Representation Guidance. This technique during inference modifies the posterior distribution to: ${\hat{𝑝}}_{𝜃} (𝑥_{𝑡}, 𝑧_{𝑡}) \propto 𝑝_{𝜃} (𝑥_{𝑡}) 𝑝_{𝜃} {(𝑧_{𝑡} | 𝑥_{𝑡})}^{𝑤_{𝑟}}$ , where $𝑤_{𝑟}$ controls how strongly samples are pushed toward higher likelihoods of the conditional distribution $𝑝_{𝜃} (𝑧_{𝑡} | 𝑥_{𝑡})$ . This yields the guided score function:

\nabla_{𝑥_{𝑡}} \log {\hat{𝑝}}_{𝜃} (𝑥_{𝑡}, 𝑧_{𝑡}) = \nabla_{𝑥_{𝑡}} \log 𝑝_{𝜃} (𝑥_{𝑡}) + 𝑤_{𝑟} (\nabla_{𝑥_{𝑡}} \log 𝑝_{𝜃} (𝑧_{𝑡} | 𝑥_{𝑡}))

= \nabla_{𝑥_{𝑡}} \log 𝑝_{𝜃} (𝑥_{𝑡}) + 𝑤_{𝑟} (\nabla_{𝑥_{𝑡}} \log 𝑝_{𝜃} (𝑧_{𝑡}) - \nabla_{𝑥_{𝑡}} \log 𝑝_{𝜃} (𝑥_{𝑡}))

By recalling the equivalence of denoisers and scores, we implement this representation-guided prediction $𝜀_{𝜃} (𝑥_{𝑡}, 𝑧_{𝑡}, 𝑡)$ at each denoising step as follows:

{\hat{𝜀}}_{𝜃} (𝑥_{𝑡}, 𝑧_{𝑡}, 𝑡) = 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡) + 𝑤_{𝑟} (𝜀_{𝜃} (𝑥_{𝑡}, 𝑧_{𝑡}, 𝑡) - 𝜀_{𝜃} (𝑥_{𝑡}, 𝑡))

8. REPA-E

Figure 11: Can we unlock VAE for end-to-end tuning with latent-diffusion models?

Higher representation-alignment score correlates with improved generation performance. This offers an alternate path for improving final generation performance using representation-alignment score as a proxy.

The maximum achievable alignment score with vanilla-REPA is bottlenecked by the VAE latent space features.

8.1. End-to-End Training with REPA

Batch-Norm Layer for VAE Latent Normalization. To enable end-to-end training, we first introduce a batchnorm layer between the VAE and latent diffusion model. Typical LDM training involves normalizing the VAE features using precomputed latent statistics (e.g., std $= \frac{1}{0.1825}$ for SD-VAE). This helps normalize the VAE latent outputs to zero mean and unit variance for more efficient training for the diffusion model. However, with end-to-end training the statistics need to be recomputed whenever the VAE model is updated - which is expensive. To address this, we propose the use of a batchnorm layer which uses the exponential moving average (EMA) mean and variance as a surrogate for dataset-level statistics. The batch-norm layer thus acts as a differentiable normalization operator without the need for recomputing dataset level statistics after each optimization step.

End-to-End Representation-Alignment Loss. We next enable end-to-end training, by using the REPA loss for updating the parameters for both VAE and LDM during training.

Diffusion Loss with Stop-Gradient. Backpropagating the diffusion loss to the VAE causes a degradation of latent-space structure.

VAE Regularization Losses. Finally, we introduce regularization losses $ℒ_{reg}$ for VAE $𝒱_{𝜑}$ , to ensure that the end-to-end training process does not impact the reconstruction performance (rFID) of the original VAE.

Overall Training. The overall training is then performed in an end-to-end manner using the following loss,

ℒ (𝜃, 𝜑, 𝜔) = ℒ_{DIFF} (𝜃) + 𝜆 ℒ_{REPA} (𝜃, 𝜑, 𝜔) + 𝜂 ℒ_{REG} (𝜑)

9. U-REPA

Adapting REPA to U-Net architectures.

10. SoftREPA

Figure 13: (a) Learnable soft tokens of each layer are prepended to the text features across the upper layers. (b) The soft tokens are optimized to contrastively match the score with positively conditioned predicted noise while repelling the score from negatively conditioned predicted noise.

While modern text-to-image (T2I) generative models have achieved remarkable success, a persistent challenge is the occasional misalignment between the generated image and the input text prompt. This can manifest as incorrect objects, attributes, or compositions. Existing approaches to fix this often require extensive fine-tuning or specialized preference datasets, which can be computationally expensive.

10.1. The SoftREPA Method

SoftREPA enhances text-image alignment by leveraging contrastive learning with a small set of trainable parameters, known as "soft tokens," while keeping the large pre-trained model frozen.

10.1.1. Learnable Soft Tokens

Soft tokens are learnable vectors that do not correspond to any specific words in a vocabulary. During the generation process, these tokens are prepended to the original text embeddings at various layers of the model's denoiser.

Because the main model remains frozen, only these soft tokens—fewer than 1 million parameters in total—are optimized during training. They act as adaptable guides, steering the model's internal representations toward better semantic alignment with the text prompt without requiring a full model retrain.

10.1.2. Contrastive T2I Alignment Loss

SoftREPA is trained using a contrastive framework that teaches the model to distinguish between correctly paired images and text (positive pairs) and mismatched pairs (negative pairs).

The "similarity" between an image $𝑥$ and a text $𝑦$ is defined based on the model's denoising performance. Intuitively, if the pair is a good match, the model $𝑣_{𝜃}$ should accurately predict the noise vector $𝜀$ that was added to the image. The similarity logit $\tilde{𝑙}$ is formulated as an exponential of the negative denoising error:

\tilde{𝑙} (𝑥, 𝑦, 𝑠) = \exp (- \frac{{‖ 𝑣_{𝜃} (𝑥_{𝑡}, 𝑡, 𝑦, 𝑠) - (𝜀 - 𝑥_{0}) ‖}^{2}}{𝜏 (𝑡)})

Here, a smaller denoising error results in a higher similarity score (closer to 1). The final SoftREPA loss is a contrastive objective that maximizes the similarity of positive pairs while minimizing it for negative pairs. Given a positive pair $(𝑥, 𝑦)$ and a set of negative texts $𝑦^{𝑗}$ , the loss for the learnable tokens $𝑠$ is:

ℒ_{SoftREPA} (𝑠) = - 𝐸 [\log (\frac{\exp (\tilde{𝑙} (𝑥, 𝑦, 𝑠))}{\sum_{𝑗} \exp (\tilde{𝑙} (𝑥, 𝑦^{𝑗}, 𝑠))})]

This objective function effectively pushes the model's predicted noise for a positive pair closer to the ground truth while pushing the predictions for negative pairs further away.

11. Representation Autoencoders (RAE)

Figure 14: The VAE relies on convolutional backbones with aggressive down- and up-sampling, while the RAE uses a ViT architecture without compression.

DiT have emerged as powerful models for generative tasks, regularly using a variational autoencoder (VAE) to compress input images into low-dimensional latent spaces for the diffusion process. However, most existing DiTs still depend on traditional VAE encoders that:

Use outdated architectures
Restrict information capacity with low-dimensional latents
Rely on loss functions optimized only for reconstruction, limiting generative quality

11.1. High Fidelity Reconstruction From Frozen Encoders

Figure 15: RAEs consistently outperform SD-VAE in reconstruction (rFID) and representation quality (linear probing accuracy) on ImageNet-1K, while being more efficient. If not specified, we use ViTXL as the decoder and DINOv2-B as the encoder for RAE. Default settings in this paper are in gray.

DiT does not work out of the box. To our surprise, the standard diffusion recipe fails with RAE. Training directly on RAE latents causes a small backbone such as DiT-S to completely fail, while a larger backbone like DiT-XL significantly underperforms it's counterpart with the SD-VAE latents. To investigate this observation, we raise several hypotheses detailed below, which we will discuss in the following sections:

Figure 16: Overfitting to a single sample. Left: increasing model width lead to lower loss and better sample quality; Right: changing model depth has marginal effect on overfitting results.

Suboptimal design for diffusion transformers. When modeling high-dimensional RAE tokens, the optimal design choices for diffusion transformers can diverge from those of the standard DiT, which was originally tailored for low-dimensional VAE tokens

$\to$ Suboptimal design for diffusion transformers. We now fix the width of DiT to be at least as large as the RAE token dimension. For RAE with the DINOv2-B encoder, we pair it with DiT-XL in our following experiments.

Suboptimal noise scheduling. Prior noise scheduling and loss re-weighting tricks are derived for image-based or VAE-based input, and it remains unclear if they transfer well to high-dimension semantic tokens

$\to$ Suboptimal noise scheduling. We now default the noise schedule to be dependent on the effective data dimension for all our following experiments.

Diffusion generates noisy latents. VAE decoders are trained to reconstruct images from noisy latents, making them more tolerant to small noises in diffusion outputs. In contrast, RAE decoders are trained on only clean latents and may therefore struggle to generalize.

$\to$ Diffusion generates noisy latents. We now adopt the noise-augmented decoding for all our following experiments.

12. Self-supervised representations for Visual Generation (SVG)

The authors introduce SVG—a latent diffusion generator without a VAE. Instead, SVG uses self-supervised DINO features, which retain strong semantic discriminability. The framework consists of a frozen DINO backbone for feature extraction and a residual branch for fine visual details. The diffusion process takes place directly in this semantic space.

13. End-to-end Pixel-space Generative model (EPG)

Most high-resolution image generative models (e.g., diffusion, consistency models) rely on VAE-based latent representations to enable efficient training and high quality. Pure pixel-space generative modeling (operating directly on raw pixels, without a latent bottleneck) is much more challenging due to higher variance and complexity, and thus usually underperforms latent-space approaches.

This paper proposes a two-stage framework to significantly improve pixel-space generative models:

Self-supervised Pre-training:
- An encoder is trained via contrastive and representation consistency losses, ensuring that features are robust to noise and sampling trajectory. This step aligns the semantic representations between noisy inputs and their clean counterparts in pixel space.
End-to-End Fine-tuning:
- The projection head is removed, and the pre-trained encoder is paired with a randomly initialized decoder to jointly optimize for image generation via diffusion or consistency objectives.
- Proposed adaptive temperature scheduling stabilizes contrastive training and prevents early-stage training collapse.

14. Discriminative Generative Image Transformer (DiGIT)

This paper introduces a new perspective on image generative modeling, highlighting that an optimal latent space for autoregressive generation should not be solely optimized for pixel reconstruction, but should also emphasize stability—meaning robustness to perturbations and resistance to error accumulation. The authors observe that autoregressive models perform worse than diffusion or iterative models (despite sharing the same latent spaces induced by standard autoencoders like VQGAN) because their generation process amplifies instability in the latent space.

To address this, the paper proposes a simple but effective method:

The encoder and decoder are trained separately. The encoder is realized as a discriminative self-supervised model (such as DINOv2), which extracts robust, semantically meaningful features from image patches without training for reconstruction.
These patch-level features are then discretized into tokens via K-Means clustering, forming a discrete and more stable latent codebook (tokenizer).
An autoregressive Transformer is trained to generate these discrete tokens, following the standard causal next-token prediction setup (like GPT). The pixel decoder is trained separately to reconstruct images from the sequence of tokens.

Key findings:

The discriminatively-induced latent space is much more stable under input noise and less sensitive to token prediction errors, making it more suitable for autoregressive image generation.
The resulting model, called DiGIT, achieves state-of-the-art results on image understanding and image generation tasks. It even surpasses diffusion models and scales well with increased model size (mirroring the scaling success of GPT in text).
Ablation studies show that a larger token vocabulary improves performance; the stability of the latent space is empirically and theoretically analyzed, drawing parallels between self-supervised discriminative encoders (LDA) and reconstructive autoencoders (PCA).
This approach challenges the conventional view that reconstruction-optimized latent spaces are ideal for generative modeling, advocating for stability-targeted designs instead.

15. UniFlow

,caption: [Comparison of different training paradigms for unified tokenizers.])

UniFlow proposes a new unified vision tokenizer designed for both visual understanding and generation tasks, overcoming trade-offs faced by existing approaches.

Method Design: UniFlow flexibly adapts pretrained vision encoders and introduces a lightweight pixel flow decoder for high-fidelity pixel reconstruction. Its core innovation is a hierarchical adaptive self-distillation mechanism, letting the unified encoder inherit strong semantic capability while preserving fine visual details and improving training efficiency.

🔒 Access Restricted

Access Control