Scale Equivariance of Diffusion Models

1. Improving Diffusion Models via Equivariance Regularization

While LDMs have achieved tremendous success in image synthesis, they suffer from an inherently unstable generation process. Even small perturbations or shifts in the input noise can lead to significantly different outputs. This instability hinders their use in applications requiring high consistency, such as video editing and image-to-image translation.

These studies found that the core of the problem lies in the undesirable properties of the latent space produced by the standard autoencoder (VAE). The solution is to regularize the VAE's training process by introducing a new equivariance loss.

1.1. Approach 1: Scale Equivariance

Two concurrent works (EQ-VAE and Improving the Diffusability of Autoencoders) focused on addressing the scale equivariance of the VAE.

1.1.1. Problem Diagnosis

  • Lack of Equivariance: Existing autoencoders lack equivariance to transformations like scaling, which introduces unnecessary complexity into the latent manifold.
  • Poor Spectral Properties: The latent spaces of modern autoencoders contain inordinate high-frequency components. This deviation from the natural spectral distribution of RGB signals becomes more pronounced as the VAE's bottleneck channel size increases.
  • Interference with Diffusion: This "flat" spectral distribution, with its strong high-frequency components, interferes with the natural "coarse-to-fine" synthesis process of diffusion models, hindering generation quality.

1.1.2. Solution: Downsampling Regularization

To solve this, these works propose an implicit regularization method using downsampling. The core idea is to force the decoder to reconstruct a downsampled image from a downsampled latent code, ensuring consistency across scales.

The scale equivariance loss can be summarized as:

scale-equiv =down(𝑥)D(down(E(𝑥)))2

The key components in this formula are:

  • 𝑥: The original, high-resolution input image.
  • 𝐸(𝑥): The Encoder, which compresses the image 𝑥 into a latent code 𝑧.
  • 𝐷(𝑧): The Decoder, which reconstructs an image from the latent code 𝑧.
  • down(.): A downsampling function (e.g., bilinear interpolation) that reduces the spatial resolution of an image or tensor.
  • down(E(𝑥)): The key operation, where the image is first encoded into a latent representation, and then the latent representation itself is downsampled.
  • D(down(E(𝑥))): The decoder's attempt to reconstruct a low-resolution image from this downsampled latent code.

1.2. Approach 2: Shift Equivariance

Another concurrent work (Alias-Free LDM) approached the problem from the perspective of "aliasing," focusing on improving the model's shift equivariance.

1.2.1. Problem Diagnosis

  • Aliasing Effects: Operations in neural networks like upsampling, downsampling, and non-linearities can introduce aliasing because they are not properly band-limited. This is identified as a primary cause for the lack of shift equivariance.
  • Aliasing Amplification: This aliasing effect is amplified during the VAE training process and across the multiple iterative steps of U-Net denoising.
  • Attention Modules: The self-attention modules used in the U-Net are inherently sensitive to global translations and are not shift-equivariant.

1.2.2. Solution: Shift Regularization and Equivariant Attention

This research proposed a multi-faceted solution.

  1. Shift Equivariance Loss: An equivariance loss was introduced to directly regularize the network's learning process for shift consistency.

    shift-equiv =𝑓(𝑇Δ(𝑥))𝑇𝑘Δ(𝑓(𝑥))2

    Here, 𝑇Δ represents a shift operation by Δ pixels, and 𝑘 is the network's scaling factor.

  2. Equivariant Attention: To address the self-attention issue, the paper proposes to keep the Query changing with the shifted features, while fixing the Key and Value to a constant reference frame. This formulation is identical to Cross-Frame Attention (CFA) used in video editing.

With these modifications, the resulting Alias-Free LDM (AF-LDM) achieves strong shift-equivariance, producing far more consistent and stable results in applications like video editing than baseline models.

References

  1. EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
  2. Improving the Diffusability of Autoencoders
  3. Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space