Flowing Seamlessly Across Text and Image Tokens

July 4, 2025

by Leonardo

1. FlowTok: Flowing Seamlessly Across Text and Image Tokens

FlowTok presents a revolutionary approach to cross-modal generation by enabling direct flow matching between text and image modalities. Unlike conventional methods that treat text as a conditioning signal for image generation, FlowTok projects both modalities into a unified, compact 1D latent space.

1.1. Direct Flow Between Modalities

1.1.1. Unified Latent Space Design

FlowTok encodes both text and images into compact 1D tokens with shape $77 \times 16$ :

Text Processing:

CLIP text encoder extracts initial embeddings $𝑇_{init} \in 𝑅^{𝑁 \times 𝐶}$
Text projector maps to latent space: $𝑍^{𝑇} \in 𝑅^{𝑁 \times 𝐷}$
Gaussian distribution modeling with KL regularization

Image Processing:

Enhanced TA-TiTok with RoPE and SwiGLU FFN
Direct encoding to $𝑍^{𝐼} \in 𝑅^{𝐾 \times 𝐷}$ where $𝐾 = 𝑁 = 77$
Maintains semantic information in compact representation

1.1.2. Flow Matching Framework

The flow matching objective learns direct transformation:

𝑋_{𝑡} = (1 - 𝑡) \cdot 𝑋 + 𝑡 \cdot 𝑁

where the velocity field is:

𝑉_{𝑡} = \frac{𝑑 𝑋_{𝑡}}{𝑑 𝑡} = 𝑁 - 𝑋

Unlike standard flow matching that uses noise as source distribution, FlowTok treats text tokens $𝑍^{𝑇}$ and image tokens $𝑍^{𝐼}$ as both source and target distributions.

1.1.3. Semantic Preservation

To prevent information loss during dimensionality reduction, FlowTok introduces text alignment loss:

ℒ_{align} = \frac{CE ({logits}_{TZ}, labels) + CE ({logits}_{ZT}, labels)}{2}

where:

\begin{matrix} {logits}_{TZ} & = \exp (𝜏) \times (𝑇_{𝑃} \times 𝑍_{𝑇}^{𝑇}) \\ {logits}_{ZT} & = \exp (𝜏) \times (𝑍_{𝑇} \times 𝑇_{𝑃}^{𝑇}) \end{matrix}

1.1.4. Training Objective

Complete loss function:

ℒ = ℒ_{fm} + 𝛾_{1} \cdot ℒ_{kld} + 𝛾_{2} \cdot ℒ_{align}