Language Diffusion Model

1. Large Language Diffusion with mAsking (LLaDA)

The core process involves two stages:

  • Forward Masking Process: It takes a clean text sequence and progressively corrupts it by independently masking tokens with a probability 𝑡. This masking ratio 𝑡 is randomly sampled from 0 to 1 for each training sample, meaning the model sees everything from slightly masked to fully masked text.
  • Reverse Prediction Process: A Transformer model1 is trained to predict all the masked tokens simultaneously given the corrupted text. This process is optimized as a principled generative modeling objective, which allows LLaDA to learn the underlying data distribution.
Figure 1: (a) Pre-training. LLaDA is trained on text with random masks applied independently to all tokens at the same ratio 𝑡𝑈[0,1]. (b) SFT. Only response tokens are possibly masked. (c) Sampling. LLaDA simulates a diffusion process from 𝑡=1 (fully masked) to 𝑡=0 (unmasked), predicting all masks simultaneously at each step with flexible remask strategies.

1.1. Remasking?

At each step of the generation process, the model predicts all the masked tokens simultaneously. However, instead of keeping all of these newly predicted words, the model intentionally "remasks" some of them, turning them back into mask tokens for the next step. The number of tokens that are remasked is gradually reduced at each step until no masks are left.

Remasking Strategies in LLaDA:

  • Random Remasking: In principle, the tokens to be remasked can be chosen randomly.
  • Low-Confidence Remasking: A more advanced strategy is to remask the tokens that the model predicted with the lowest confidence. This allows the model to "re-think" the parts of the sentence it is least sure about in the next step.
  • Semi-Autoregressive Remasking: For the instruction-following version of LLaDA, we can divide the sequence into several blocks and generate them from left to right. Within each block, a remasking strategy (like low-confidence remasking) is applied. This approach was found to be particularly effective for the instruction-tuned model.

2. LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Figure 2: Image representations are generated by an encoder and an MLP projector (not explicitly shown). (a) Autoregressive Training: Given image features and the input prompt, autoregressive models are trained to predict the response through next-token prediction. (b) LLaDA-V's Training: Image features and the input prompt remain unmasked, while only the response is randomly masked. (c) LLaDA-V's Inference: As time step 𝑡 decreases from 1 to 0, generation begins with a fully masked response and iteratively predicts tokens.