Single-pass Adaptive Image Tokenization for Minimum Program Search

1. KARL (Kolmogorov-Approximating Representation Learning)

Traditional computer vision models rely on fixed-length representations for all images, regardless of their inherent complexity. This approach contradicts fundamental principles from Algorithmic Information Theory (AIT), which suggests that intelligent systems should compress data into their shortest possible programs. Simple images with repetitive patterns should require fewer tokens than complex scenes with intricate details, yet most current methods allocate identical computational resources to both.

FigureΒ 1: While existing methods require multiple passes or are constrained by subset requirements, KARL achieves adaptive tokenization in a single pass while maintaining alignment with Kolmogorov Complexity principles.

The system uses a Perceiver-inspired architecture with three main components:

  • Latent Distillation Encoder: Processes 2D image tokens alongside learnable 1D latent tokens, producing both token embeddings and halting probabilities
  • Adaptive Masking: Uses learned halting probabilities to determine which tokens are essential for reconstruction
  • Cross-Attention Decoder: Reconstructs images using only active (non-halted) tokens

The training process alternates between two phases within each iteration:

Phase 1: Estimate Image Complexity (EIC)

  • Input: Image π‘₯ and random token budget 𝑇
  • Target: Near-lossless reconstruction (πœ€=0)
  • Output: Empirical reconstruction error πœ€0

The loss function for this phase is:

β„’EICΒ =β„’Β reconΒ (π‘₯̂𝑇,π‘₯)+𝛽ℒ quantΒ (𝑧𝑇)

Phase 2: Learn to Tokenize Complexity (LTC)

  • Input: Same image π‘₯, expanded token budget 𝑇+Δ𝑇, and target quality πœ€0
  • Target: Match the reconstruction quality πœ€0 using only 𝑇 tokens
  • Output: Token embeddings and halting probabilities

The total loss includes reconstruction, quantization, and halting components:

β„’LTCΒ =β„’Β reconΒ (π‘₯̂𝑇+Δ𝑇,π‘₯)+𝛽ℒ quantΒ (𝑧𝑇+Δ𝑇)+πœ†β„’Β haltΒ (πœ”)

The halting loss encourages the model to preserve the first 𝑇 tokens while discarding the additional Δ𝑇 tokens:

β„’haltΒ (πœ”)=BCE(πœ”0:𝑇,0)+BCE(πœ”π‘‡:𝑇+Δ𝑇,1)
FigureΒ 3: KARL Training Graph: KARL follows an upside-down RL paradigmβ€”(a) Estimate Image Complexity stage samples task-defining inputs {image, token budget 𝑇, reconstruction error πœ€0} by attempting lossless compression (πœ€=0). (b) Learn to Tokenize stage trains the modelβ€”conditioned on πœ€0β€”to match the same quality using 𝑇+Δ𝑇 tokens while halting the extra (rightmost) Δ𝑇. The shared encoder πœƒEnc and decoder πœƒDec are updated in each iteration using reconstruction losses in both stages and halting loss in the second.