Sequence Modeling Alignment between Tokenizer and Autoregressive Model

1. AliTok

Unlike natural language, which is inherently compact and allows for one-to-one mapping between words and indices with a non-parametric tokenizer, images are high-dimensional and contain significant redundancy, requiring a learnable tokenizer to effectively eliminate the redundancy. Due to the continuity of spatial information, redundancy exists not only within individual patches, but also between adjacent patches. This implies that, in the compression process, multiple tokens must collaborate to effectively remove overall redundancy and ensure a compact representation. This process naturally leads to a high degree of mutual dependency among encoded tokens, where each token relies on the complementary information provided by other related tokens to fully convey its meaning, while simultaneously offering necessary context for others. Therefore, the process of image compression leads to complex bidirectional dependencies among the encoded tokens, hindering the subsequent autoregressive models from modeling them effectively.

Figure 2: Stage 1: Training an image tokenizer with a causal decoder. Stage 2: Freezing the encoder and codebook of the tokenizer, training the autoregressive model and retraining a bidirectional tokenizer decoder.
  • Stage 1: Causal Decoder Training

    • Uses a bidirectional transformer encoder for efficient image compression
    • Introduces 𝐾=17 prefix tokens to provide initial context for the causal decoder
    • Employs a causal transformer decoder that can only attend to previous tokens
    • Forces the encoder to produce tokens that work well with unidirectional processing
  • Stage 2: Bidirectional Decoder Retraining

    • Freezes the encoder and codebook from Stage 1
    • Retrains a bidirectional decoder for improved reconstruction continuity
    • Adds 32 buffer tokens to enhance computational capacity
    • Maintains the generation-friendly token properties while improving fidelity

References

  1. AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model