Flowing Seamlessly Across Text and Image Tokens

1. FlowTok: Flowing Seamlessly Across Text and Image Tokens

FlowTok presents a revolutionary approach to cross-modal generation by enabling direct flow matching between text and image modalities. Unlike conventional methods that treat text as a conditioning signal for image generation, FlowTok projects both modalities into a unified, compact 1D latent space.

1.1. Direct Flow Between Modalities

1.1.1. Unified Latent Space Design

FlowTok encodes both text and images into compact 1D tokens with shape 77Γ—16:

Text Processing:

  • CLIP text encoder extracts initial embeddings 𝑇initΒ βˆˆπ‘…π‘Γ—πΆ
  • Text projector maps to latent space: π‘π‘‡βˆˆπ‘…π‘Γ—π·
  • Gaussian distribution modeling with KL regularization

Image Processing:

  • Enhanced TA-TiTok with RoPE and SwiGLU FFN
  • Direct encoding to π‘πΌβˆˆπ‘…πΎΓ—π· where 𝐾=𝑁=77
  • Maintains semantic information in compact representation

1.1.2. Flow Matching Framework

The flow matching objective learns direct transformation:

𝑋𝑑=(1βˆ’π‘‘)⋅𝑋+𝑑⋅𝑁

where the velocity field is:

𝑉𝑑=𝑑𝑋𝑑𝑑𝑑=π‘βˆ’π‘‹

Unlike standard flow matching that uses noise as source distribution, FlowTok treats text tokens 𝑍𝑇 and image tokens 𝑍𝐼 as both source and target distributions.

1.1.3. Semantic Preservation

To prevent information loss during dimensionality reduction, FlowTok introduces text alignment loss:

β„’alignΒ =CE(logitsTZ,Β labels)+CE(logitsZT,Β labels)2

where:

logitsΒ TZΒ =exp(𝜏)Γ—(𝑇𝑃×𝑍𝑇𝑇)Β logitsΒ Β ZTΒ =exp(𝜏)Γ—(𝑍𝑇×𝑇𝑃𝑇)

1.1.4. Training Objective

Complete loss function:

β„’=β„’Β fmΒ +𝛾1β‹…β„’Β kldΒ +𝛾2β‹…β„’Β align

where:

  • β„’fm: Flow matching loss
  • β„’kld: KL divergence regularization
  • β„’align: Text alignment preservation

References

  1. FlowTok: Flowing Seamlessly Across Text and Image Tokens