ColBERT and FILIP

1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

FigureΒ 1: Schematic diagrams illustrating query–document matching paradigms in neural IR. The figure contrasts existing approaches (sub-figures (a), (b), and (c)) with the proposed late interaction paradigm (sub-figure (d)).

The field of Information Retrieval (IR) has seen remarkable progress with the advent of deep language models. These models, often based on the Transformer architecture, achieve state-of-the-art effectiveness in ranking tasks. However, their computational cost poses a significant challenge for real-world deployment, where query latency is a critical metric.

A typical neural ranker computes a relevance score 𝑆(π‘ž,𝑑) for a given query π‘ž and a document 𝑑. The core issue is that computing this score often requires a joint, deep encoding of the query-document pair, making it impossible to pre-compute document representations offline. Let πΈπ‘ž and 𝐸𝑑 represent the sets of contextualized token embeddings for the query and document, respectively, where each embedding is a vector in 𝑅𝑑.

To address this efficiency bottleneck, ColBERT proposes a novel late interaction architecture. Instead of joint encoding, we independently encode the query and document and perform a lightweight, yet powerful, interaction step afterward. The final score is computed as a sum of maximum similarities:1

𝑆(π‘ž,𝑑)=βˆ‘π‘–=1|πΈπ‘ž|max𝑗=1|𝐸𝑑|sim(πΈπ‘žπ‘–,𝐸𝑑𝑗)=βˆ‘π‘–=1|πΈπ‘ž|max𝑗=1|𝐸𝑑|πΈπ‘žπ‘–π‘‡πΈπ‘‘π‘—β€–πΈπ‘žπ‘–β€–β€–πΈπ‘‘π‘—β€–

The model is trained end-to-end by minimizing a pairwise ranking loss function.

2. FILIP: Fine-grained Interactive Language-Image Pre-Training

Vision-Language Pre-training (VLP) models like CLIP have demonstrated powerful capabilities by aligning global image and text features. However, this global alignment approach lacks the ability to capture finer-grained relationships, such as the correspondence between specific objects in an image and words in a text description.

To address this, the FILIP (Fine-grained Interactive Language-Image Pre-training) model was introduced. It enhances cross-modal alignment by adopting a late interaction mechanism inspired by ColBERT. This mechanism operates directly within the contrastive learning objective.

Instead of comparing single global feature vectors, FILIP computes a token-wise similarity matrix between all image patch embeddings 𝐸𝐼 and all text token embeddings 𝐸𝑇. The final image-to-text similarity score 𝑠𝐼 is calculated by taking the average of token-wise maximum similarities:

𝑠𝑖,𝑗𝐼(π‘₯𝑖𝐼,π‘₯𝑗𝑇)=1𝑛1βˆ‘π‘˜=1𝑛1max0β‰€π‘Ÿ<𝑛2[π‘“πœƒ(π‘₯𝑖𝐼)]π‘˜π‘‡[π‘”πœ‘(π‘₯𝑗𝑇)]π‘Ÿ

where 𝑛1 and 𝑛2 are the number of image and text tokens, respectively. A symmetric operation is performed to compute the text-to-image similarity 𝑠𝑇. This encourages a detailed alignment between image patches and textual words.

The model is trained using a standard contrastive loss β„’ over batches of image-text pairs. The key advantages are:

  • It maintains the efficiency of dual-stream models, allowing for offline pre-computation of image and text representations.
  • It achieves superior performance in downstream tasks, outperforming CLIP on zero-shot ImageNet classification and various image-text retrieval benchmarks, even with less training data.
  • Visualizations confirm that FILIP learns meaningful, fine-grained alignments, correctly mapping textual tokens to corresponding image regions.
    1. This approach is inspired by the idea that for a document to be relevant, each important concept in the query should find a strong semantic match within the document.

References

  1. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
  2. FILIP: Fine-grained Interactive Language-Image Pre-Training