Union.ai
Flyte
AI
Agentic AI
Tutorial

Building a High-Performance Benchmarking Framework for Multimodal Document Retrieval

Muhammad Adil Fayyaz

Muhammad Adil Fayyaz

Most retrieval benchmarks are designed as fixed reference points. While models and datasets continue to grow, evaluation is often repeated against the same static setup. 

We wanted something slightly different: a benchmark that functions not just as a reference, but a pipeline that can be used as a reusable system with a new model or dataset in minutes, without recomputing work that has already been done. 

In this blog, we discuss how we built that framework and what we observed when running it across three fundamentally different retrieval approaches on the ViDoRe benchmark. 

Problem

Visual document retrieval is the task of finding a page in a document corpus given a text query without OCR preprocessing, and operating directly over raw page images. Although it resembles a retrieval problem, it is fundamentally a representation problem: how should a page image and a text query be encoded so that the relevant page ranks highest? 

We evaluated three approaches, each grounded in different design philosophies. 

ColPali-v1.2 feeds the page image into PaliGemma, a vision-language model, and produces roughly 1024 embedding vectors per page (one per image patch). At query time, each token is matched to its most relevant patch via a MaxSim operation. This later interaction scoring allows the model to localize fine-grained regions of a page rather than compressing everything into a single representation. 

SigLIP SO400M takes the opposite approach: one global embedding per page, one for the query, scored with cosine similarity. The architecture is simple and computationally efficient, and on many tasks competitive with patch-level models. The tradeoff here is that a single vector cannot localize specific pages. If we search for a particular segment in a page the dense global embeddings may dilute the signal. 

OCR + BM25 removes natural representations entirely. Pages are processed with doctr (a GPU-accelerated OCR engine), extracted text is indexed and queries are scored with BM25. This remains as a strong baseline for text-dense documents, while being entirely insensitive to visual structure. 

The Pipeline

The framework runs on Flyte, a workflow orchestration system for data and ML pipelines. This matters less for the orchestration itself and more for two specific properties: deterministic task-level caching and the ability to keep GPU containers warm between invocations. Both turn out to be load bearing for the design. 

The framework consists of five stages: 

 Load → Index → Search → Evaluate → Report

Each stage caches its output against its inputs. If the pipeline is rerun with a different `top_k` value, the indexing stage is skipped entirely. GPU time is only spent on genuinely new computations. 

Copied to clipboard!
@colpali_indexer.task(cache="auto", retries=2)
async def index_colpali(page_ids: list[str], page_files: list[File]) -> File:
    ...

On the ViDoRe v3 finance subset (~2900 pages), ColPali indexing requires substantial GPU time. When configuring hyperparameters or comparing additional models, re-encoding the full corpus would take a massive amount of time. The pipeline avoids that cost by design `cache=”auto”`.

Similarly, the same principle applies to data loading. Page images are uploaded to blob storage and referenced by ID. Subsequent re-runs reuse the metadata rather than transferring raw files again. 

To extend the framework to a new model, only two components must be implemented: 

  1. An encoder that maps page images to embeddings.
  2. A scorer that evaluates query embeddings against the indexed corpus.

The surrounding infrastructure including caching, sharding, evaluation, reporting remains unchanged. 

Effective GPU Utilization

If we attempt to run multiple experiments concurrently, ineffective GPU utilization often becomes a problem. Suppose you have 8 concurrent search calls and each trigger a GPU forward pass, the GPU does not see 8x the work, it sees 8 small, poorly utilized batches in rapid succession. 

The root cause is a lack of coordination across task invocations. Without aggregation, concurrent calls result in fragmented workloads. 

The solution is a `DynamicBatcher` – a process-level singleton that aggregates queries from concurrent callers into a shared queue. Batches are assembled according to a cost budget and timeout, then dispatched as a single GPU workload. 

The pattern is enabled by Flyte’s `ReusePolicy`, which keeps a single GPU container alive between task dispatches rather than cold starting a fresh container per invocation. Without it, each `search_colpali` call would land in a new process with a new batcher and there would be nothing to aggregate across. The singleton only works because the container is shared. 

Copied to clipboard!
colpali_indexer = flyte.TaskEnvironment(
    name="vidore-colpali-indexer",
    resources=flyte.Resources(cpu=4, memory="16Gi", gpu="A10G:1"),
    reusable=flyte.ReusePolicy(replicas=1, concurrency=8),
)
"""
With concurrency=8, up to 8 search shards dispatch concurrently onto the same warm container, all feeding the same batcher queue. The GPU sees one large workload, not eight small ones.
"""

Each `search_colpali` invocation `calls _get_colpali_search_batcher` gets the same batcher, and submits its queries. The GPU receives curated batches designed for high-utilization. 

Copied to clipboard!
_colpali_batcher: DynamicBatcher | None = None

async def _get_colpali_search_batcher(index_file: File) ->
DynamicBatcher:
    global _colpali_batcher
    if _colpali_batcher is not None:
        return _colpali_batcher
    async with _colpali_batcher_lock:
        if _colpali_batcher is not None:
            return _colpali_batcher
        # ... load model and index, create batcher
        await batcher.start()
        _colpali_batcher = batcher
    return _colpali_batcher

A subtle implementation detail is that the GPU forward pass must execute in a thread pool `(asyncio.to_thread)` rather than directly in the event loop. Running it inline would block the event loop during the forward pass thereby preventing the batcher from aggregating additional queries. By offloading GPU work to a separate thread, the aggregation loop remains responsive.

Copied to clipboard!
async def colpali_process_fn(batch: list[PageQuery]) ->
list[list[str]]:
    def _gpu_work() -> list[list[str]]:
        # PyTorch forward pass here
        ...
    return await asyncio.to_thread(_gpu_work)

While one batch is executing on the GPU, the next is already being assembled. 

GPU Behavior Across Models

The three experiments produce meaningfully different GPU utilization profiles, reflecting their architectural differences as much as their accuracy differences. 

ColPali-v1.2

Figure 1a: GPU Memory climbs to ~90% at model load and stays pinned throughout – the full patch index lives in GPU Memory for the duration. Utilization oscillates between 60-100%, reflecting the continuous cost of MaxSum scoring across ~1024 embeddings per page.

OCR + BM25

Figure 1b: GPU memory stays near 10% throughout – there is no model index to hold. Utilization spikes sharply to 100% in short bursts as doctr processes document batches, then drops completely to zero between batches. Once OCR is complete, retrieval is entirely CPU-bound.

SigLIP SO400M

Figure 1c: The entire experiment completes in under 4 minutes. Memory allocates to ~25%, substantially less than ColPali, and utilization peaks briefly before setting at ~50% as the lightweight single-vector cosine scoring wraps up quickly.

Indexing

During indexing, there is only one caller, so batching across concurrent invocations is unnecessary. Instead, parallelism occurs between image download and GPU execution.

The implementation uses explicit prefetching. Before the GPU processes batch N, images for batch N+1 are downloaded in a thread pool. When the forward pass completes, the next batch’s inputs are ready.

Copied to clipboard!
prefetch = [loop.run_in_executor(None, _load_image_sync, f) for f in
batches[0]]

for batch_idx in range(n_batches):
    images = list(await asyncio.gather(*prefetch))

    if batch_idx + 1 < n_batches:
        prefetch = [
            loop.run_in_executor(None, _load_image_sync, f)
            for f in batches[batch_idx + 1]
        ]

    # GPU forward pass — next batch is already downloading
    with torch.no_grad():
        emb = model(**inputs)

The result is a steady pipeline: the GPU does not wait on I/O, and I/O does not wait on the GPU. 

Results 

We evaluated the pipeline on the ViDoRe v3 finance subset with 2,000 corpus pages and 1854 queries. 

Figure 2: Retrieval performance across four configurations on the ViDoRe v3 finance subset. ColPali dominates across all three metrics at K=5 and K=10.
Experiment Model Recall@K NDCG@K MRR K
colpali-top10 colpali-v1.2 0.465 0.411 0.263 10
colpali-top5 colpali-v1.2 0.370 0.379 0.248 5
ocr-bm25-top5 ocr+bm25 0.142 0.097 0.082 5
siglip-top5 siglip-so400m 0.022 0.016 0.013 5

Colapali achieved the strongest results, which is consistent with the structure of financial documents. Patch-level representations can localize tables, figures, and dense numerical regions that global embeddings may smooth over.

The BM25 baseline is also informative. Despite having no neural components, it significantly outperformed SigLIP on Recall@5. Finance documents are text-dense and keyword overlap remains a strong signal in such settings. SigLIP’s global embedding struggled to compete without fine-tuning.

One takeaway is not that global embeddings are ineffective, but that the document heavy retrieval tasks may reward localized or token-aware representations. 

Takeaways

It’s noteworthy to point out that SigLIP, a neural model trained on hundreds of millions of image-text pairs, producing richer representations than any keyword index, scored a Recall@5 of 0.022, worse than BM25 by a factor of six. On financial documents, keyword overlap is a stronger signal than global semantic similarity. The architecture that wins on a natural image retrieval can fail badly in a document heavy domain without fine-tuning. 

That kind of finding is easy to miss when benchmarking is expensive. If running one model costs significant GPU time and manual setup, you run fewer experiments and rely more on intuition about which approach would work. The point of a reusable pipeline is that it lowers the cost of being wrong. You run the experiment instead of reasoning about it. 

For the full implementation, see: https://github.com/unionai/unionai-examples/tree/main/v2/tutorials/multimodal-retrieval-evaluation

Thanks!

Someone from the Union.ai team will reach out shortly.
Oops! Something went wrong while submitting the form.
No items found.

More from Union.

Flyte 2 OSS: Backend Devbox and Reimagined UI

Flyte 2 OSS: Backend Devbox and Reimagined UI

Union.ai
Flyte
Data Processing
Training & Finetuning
Inference