Vision foundation model Aligned Variational AutoEncode

July 4, 2025

by Leonardo

1. Vision foundation model Aligned Variational AutoEncode (VA-VAE)

As visual tokenizers become more sophisticated with higher-dimensional latent spaces to improve reconstruction quality, they paradoxically become harder for diffusion models to work with, leading to poor generation performance.

Figure 1: Optimization dilemma within latent diffusion models. In latent diffusion models, increasing the dimension of the visual tokenizer enhances detail reconstruction but significantly reduces generation quality.

To resolve this dilemma, the research introduces a Vision Foundation Model Aligned Variational AutoEncoder (VA-VAE) that uses a novel Vision Foundation Model Alignment Loss (VF Loss). This approach leverages the structured, semantically meaningful representations learned by pre-trained vision foundation models to guide the tokenizer's latent space toward being more "generation-friendly."

Figure 2: Vision foundation models are used to guide the training of high-dimensional visual tokenizers, effectively mitigating the optimization dilemma and improve generation performance.

Relationship to REPA: REPA aims to employ vision foundation models to constrain DiT, thereby enhancing the convergence speed of generative models. In contrast, our work takes into account both the reconstruction and generative capabilities within the latent diffusion model, with the objective of leveraging foundation models to regulate the highdimensional latent space of the tokenizer, thereby resolving the optimization conflict between the tokenizer and the generative model.

1.1. Align VAE with Vision Foundation Models

Vision Foundation model alignment loss (VF loss) consists of two components: marginal cosine similarity loss and marginal distance matrix similarity loss.

1.1.1. Marginal Cosine Similarity Loss

We project the image latents $𝑍$ to match the dimensionality of foundational visual representations $𝐹$ using a linear transformation $𝑊$ , producing $𝑍^{'} = 𝑊 𝑍$ .

The Marginal Cosine Similarity Loss enforces element-wise alignment between the VAE's latent features and foundation model features, focusing alignment on less similar pairs:

ℒ_{mcos} = \frac{1}{ℎ \times 𝑤} \sum_{𝑖 = 1}^{ℎ} \sum_{𝑗 = 1}^{𝑤} \max (0, 𝑚_{1} - \cos (𝑧_{𝑖 𝑗}^{'}, 𝑓_{𝑖 𝑗}))

1.1.2. Marginal Distance Matrix Similarity Loss

Complementary to $𝐿_{mcos}$ , which enforces point-to-point absolute alignment, we also aim for the relative distribution distance matrices within the features to be as similar as possible. The Marginal Distance Matrix Similarity Loss aligns the internal structure and relationships within the latent space:

ℒ_{mdms} = \frac{1}{𝑁^{2}} \sum_{𝑖 𝑗} \max (0, | \cos (𝑧_{𝑖}^{'}, 𝑧_{𝑗}^{'}) - \cos (𝑓_{𝑖}, 𝑓_{𝑗}) | = 𝑚_{2})

Here, $𝑁 = ℎ \times 𝑤$ represents the total number of elements in each flattened feature map.

1.1.3. Adaptive Weighting

𝐿_{vf} = 𝑤_{hyper} \cdot 𝑤_{adaptive} (ℒ_{mcos} + ℒ_{mdms})

The adaptive weighting function $𝑤_{adaptive}$ is defined as $\frac{‖ \nabla ℒ_{rec} ‖}{‖ \nabla ℒ_{vf} ‖}$ to ensure $ℒ_{vf}$ and $ℒ_{rec}$ have similar impacts on model optimization.

References

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

🔒 Access Restricted

Access Control