Single-pass Adaptive Image Tokenization for Minimum Program Search

July 15, 2025

by Leonardo

1. KARL (Kolmogorov-Approximating Representation Learning)

Traditional computer vision models rely on fixed-length representations for all images, regardless of their inherent complexity. This approach contradicts fundamental principles from Algorithmic Information Theory (AIT), which suggests that intelligent systems should compress data into their shortest possible programs. Simple images with repetitive patterns should require fewer tokens than complex scenes with intricate details, yet most current methods allocate identical computational resources to both.

Figure 1: While existing methods require multiple passes or are constrained by subset requirements, KARL achieves adaptive tokenization in a single pass while maintaining alignment with Kolmogorov Complexity principles.

The system uses a Perceiver-inspired architecture with three main components:

Latent Distillation Encoder: Processes 2D image tokens alongside learnable 1D latent tokens, producing both token embeddings and halting probabilities
Adaptive Masking: Uses learned halting probabilities to determine which tokens are essential for reconstruction
Cross-Attention Decoder: Reconstructs images using only active (non-halted) tokens

The training process alternates between two phases within each iteration:

Phase 1: Estimate Image Complexity (EIC)

Input: Image $𝑥$ and random token budget $𝑇$
Target: Near-lossless reconstruction ( $𝜀 = 0$ )
Output: Empirical reconstruction error $𝜀_{0}$

The loss function for this phase is:

ℒ_{EIC} = ℒ_{recon} ({\hat{𝑥}}_{𝑇}, 𝑥) + 𝛽 ℒ_{quant} (𝑧_{𝑇})

Phase 2: Learn to Tokenize Complexity (LTC)

Input: Same image $𝑥$ , expanded token budget $𝑇 + Δ 𝑇$ , and target quality $𝜀_{0}$
Target: Match the reconstruction quality $𝜀_{0}$ using only $𝑇$ tokens
Output: Token embeddings and halting probabilities

The total loss includes reconstruction, quantization, and halting components:

ℒ_{LTC} = ℒ_{recon} ({\hat{𝑥}}_{𝑇 + Δ 𝑇}, 𝑥) + 𝛽 ℒ_{quant} (𝑧_{𝑇 + Δ 𝑇}) + 𝜆 ℒ_{halt} (𝜔)

The halting loss encourages the model to preserve the first $𝑇$ tokens while discarding the additional $Δ 𝑇$ tokens:

ℒ_{halt} (𝜔) = BCE (𝜔_{0 : 𝑇}, 0) + BCE (𝜔_{𝑇 : 𝑇 + Δ 𝑇}, 1)

🔒 Access Restricted

Access Control

Single-pass Adaptive Image Tokenization for Minimum Program Search

1. KARL (Kolmogorov-Approximating Representation Learning)