Generative Model

We want to sample from 𝑝(π‘₯), but how? We usually have four ways:

Paradigm Idea
Directly model 𝑝(π‘₯) Explicit density: 𝑝(π‘₯)=π‘ž(π‘₯)𝑍,𝑍=βˆ«π‘ž(π‘₯)𝑑π‘₯
Or tractable forms: 𝑝(π‘₯)=βˆπ‘–π‘(π‘₯𝑖|π‘₯<𝑖)
Latent Variable 𝑝(π‘₯|𝑧) Marginalization: 𝑝(π‘₯)=βˆ«π‘(π‘₯|𝑧)𝑝(𝑧)𝑑𝑧
Variational bound: log𝑝(π‘₯)β‰₯Β ELBO
Implicit Generation 𝐺(𝑧)β†’π‘₯ Pushforward measure: 𝑝𝑔(π‘₯)=(πΊπœƒ)#𝑝(𝑧)
No explicit density
Score-Based βˆ‡π‘₯log𝑝(π‘₯) Score matching: Learn π‘ πœƒ(π‘₯)β‰ˆβˆ‡π‘₯log𝑝(π‘₯)
Reverse diffusion process

1. Continuous Generative Models

  • Discrete Generative Models

    • Pros:

      • Efficient Inference: They can be very fast, often requiring only one pass of the transformer to generate a sequence.
    • Cons:

      • Quality Issues: Discrete tokens suffer from a "quality issue" due to high data compression, which results in the loss of fine details. Reconstructed images can look significantly different up close from the original.
      • Fundamental Compression Flaw: To be manageable, a sequence of discrete tokens must compress information far more than a continuous representation, which is a fundamental limitation.
  • Continuous Generative Models

    • Pros:

      • High-Quality Samples: They generally offer much better reconstruction than discrete models.
    • Cons:

      • Speed Issues: Continuous models, particularly diffusion, have a "speed issue" because they require many iterative steps to generate a sample. This multi-pass process makes inference slow and computationally demanding.
FigureΒ 1: The trilemma of continuous generative models

Do inference-time scaling benefit generative pre-training algorithms? Maybe.

References

  1. New Pre-training Paradigms from a Inference-First Perspective
  2. Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms