Scale Equivariance of Diffusion Models
1. Improving Diffusion Models via Equivariance Regularization
While LDMs have achieved tremendous success in image synthesis, they suffer from an inherently unstable generation process. Even small perturbations or shifts in the input noise can lead to significantly different outputs. This instability hinders their use in applications requiring high consistency, such as video editing and image-to-image translation.
These studies found that the core of the problem lies in the undesirable properties of the latent space produced by the standard autoencoder (VAE). The solution is to regularize the VAE's training process by introducing a new equivariance loss.
1.1. Approach 1: Scale Equivariance
Two concurrent works (EQ-VAE and Improving the Diffusability of Autoencoders) focused on addressing the scale equivariance of the VAE.
1.1.1. Problem Diagnosis
- Lack of Equivariance: Existing autoencoders lack equivariance to transformations like scaling, which introduces unnecessary complexity into the latent manifold.
- Poor Spectral Properties: The latent spaces of modern autoencoders contain inordinate high-frequency components. This deviation from the natural spectral distribution of RGB signals becomes more pronounced as the VAE's bottleneck channel size increases.
- Interference with Diffusion: This "flat" spectral distribution, with its strong high-frequency components, interferes with the natural "coarse-to-fine" synthesis process of diffusion models, hindering generation quality.
1.1.2. Solution: Downsampling Regularization
To solve this, these works propose an implicit regularization method using downsampling. The core idea is to force the decoder to reconstruct a downsampled image from a downsampled latent code, ensuring consistency across scales.
The scale equivariance loss can be summarized as:
The key components in this formula are:
- : The original, high-resolution input image.
- : The Encoder, which compresses the image into a latent code .
- : The Decoder, which reconstructs an image from the latent code .
- : A downsampling function (e.g., bilinear interpolation) that reduces the spatial resolution of an image or tensor.
- : The key operation, where the image is first encoded into a latent representation, and then the latent representation itself is downsampled.
- : The decoder's attempt to reconstruct a low-resolution image from this downsampled latent code.
1.2. Approach 2: Shift Equivariance
Another concurrent work (Alias-Free LDM) approached the problem from the perspective of "aliasing," focusing on improving the model's shift equivariance.
1.2.1. Problem Diagnosis
- Aliasing Effects: Operations in neural networks like upsampling, downsampling, and non-linearities can introduce aliasing because they are not properly band-limited. This is identified as a primary cause for the lack of shift equivariance.
- Aliasing Amplification: This aliasing effect is amplified during the VAE training process and across the multiple iterative steps of U-Net denoising.
- Attention Modules: The self-attention modules used in the U-Net are inherently sensitive to global translations and are not shift-equivariant.
1.2.2. Solution: Shift Regularization and Equivariant Attention
This research proposed a multi-faceted solution.
-
Shift Equivariance Loss: An equivariance loss was introduced to directly regularize the network's learning process for shift consistency.
Here, represents a shift operation by pixels, and is the network's scaling factor.
-
Equivariant Attention: To address the self-attention issue, the paper proposes to keep the Query changing with the shifted features, while fixing the Key and Value to a constant reference frame. This formulation is identical to Cross-Frame Attention (CFA) used in video editing.
With these modifications, the resulting Alias-Free LDM (AF-LDM) achieves strong shift-equivariance, producing far more consistent and stable results in applications like video editing than baseline models.