ColBERT and FILIP

September 2, 2025

by Leonardo

1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Figure 1: Schematic diagrams illustrating query–document matching paradigms in neural IR. The figure contrasts existing approaches (sub-figures (a), (b), and (c)) with the proposed late interaction paradigm (sub-figure (d)).

The field of Information Retrieval (IR) has seen remarkable progress with the advent of deep language models. These models, often based on the Transformer architecture, achieve state-of-the-art effectiveness in ranking tasks. However, their computational cost poses a significant challenge for real-world deployment, where query latency is a critical metric.

A typical neural ranker computes a relevance score $𝑆 (𝑞, 𝑑)$ for a given query $𝑞$ and a document $𝑑$ . The core issue is that computing this score often requires a joint, deep encoding of the query-document pair, making it impossible to pre-compute document representations offline. Let $𝐸_{𝑞}$ and $𝐸_{𝑑}$ represent the sets of contextualized token embeddings for the query and document, respectively, where each embedding is a vector in $𝑅^{𝑑}$ .

To address this efficiency bottleneck, ColBERT proposes a novel late interaction architecture. Instead of joint encoding, we independently encode the query and document and perform a lightweight, yet powerful, interaction step afterward. The final score is computed as a sum of maximum similarities:¹

\begin{matrix} 𝑆 (𝑞, 𝑑) & = \sum_{𝑖 = 1}^{| 𝐸_{𝑞} |} \max_{𝑗 = 1}^{| 𝐸_{𝑑} |} sim (𝐸_{𝑞_{𝑖}}, 𝐸_{𝑑_{𝑗}}) \\ = \sum_{𝑖 = 1}^{| 𝐸_{𝑞} |} \max_{𝑗 = 1}^{| 𝐸_{𝑑} |} \frac{𝐸_{𝑞_{𝑖}}^{𝑇} 𝐸_{𝑑_{𝑗}}}{‖ 𝐸_{𝑞_{𝑖}} ‖ ‖ 𝐸_{𝑑_{𝑗}} ‖} \end{matrix}

The model is trained end-to-end by minimizing a pairwise ranking loss function.

2. FILIP: Fine-grained Interactive Language-Image Pre-Training

Vision-Language Pre-training (VLP) models like CLIP have demonstrated powerful capabilities by aligning global image and text features. However, this global alignment approach lacks the ability to capture finer-grained relationships, such as the correspondence between specific objects in an image and words in a text description.

To address this, the FILIP (Fine-grained Interactive Language-Image Pre-training) model was introduced. It enhances cross-modal alignment by adopting a late interaction mechanism inspired by ColBERT. This mechanism operates directly within the contrastive learning objective.

Instead of comparing single global feature vectors, FILIP computes a token-wise similarity matrix between all image patch embeddings $𝐸_{𝐼}$ and all text token embeddings $𝐸_{𝑇}$ . The final image-to-text similarity score $𝑠^{𝐼}$ is calculated by taking the average of token-wise maximum similarities:

𝑠_{𝑖, 𝑗}^{𝐼} (𝑥_{𝑖}^{𝐼}, 𝑥_{𝑗}^{𝑇}) = \frac{1}{𝑛_{1}} \sum_{𝑘 = 1}^{𝑛_{1}} \max_{0 \leq 𝑟 < 𝑛_{2}} {[𝑓_{𝜃} (𝑥_{𝑖}^{𝐼})]}_{𝑘}^{𝑇} {[𝑔_{𝜑} (𝑥_{𝑗}^{𝑇})]}_{𝑟}

where $𝑛_{1}$ and $𝑛_{2}$ are the number of image and text tokens, respectively. A symmetric operation is performed to compute the text-to-image similarity $𝑠^{𝑇}$ . This encourages a detailed alignment between image patches and textual words.

The model is trained using a standard contrastive loss $ℒ$ over batches of image-text pairs. The key advantages are:

It maintains the efficiency of dual-stream models, allowing for offline pre-computation of image and text representations.
It achieves superior performance in downstream tasks, outperforming CLIP on zero-shot ImageNet classification and various image-text retrieval benchmarks, even with less training data.
Visualizations confirm that FILIP learns meaningful, fine-grained alignments, correctly mapping textual tokens to corresponding image regions.

This approach is inspired by the idea that for a document to be relevant, each important concept in the query should find a strong semantic match within the document.

🔒 Access Restricted

Access Control

ColBERT and FILIP

1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

2. FILIP: Fine-grained Interactive Language-Image Pre-Training

References