Production RAG Pipeline: Chunking, Re-ranking, Latency

AI-Web3

2024-01-12

Production RAG Pipeline: Chunking, Re-ranking, Latency

Build a production RAG pipeline in 2024: chunking, 40% less retrieval noise, cross-encoder re-ranking, and latency budgets that keep p95 under 2 seconds.

Frequently Asked Questions

A bi-encoder encodes the query and each document independently into vectors, so document embeddings can be computed once and stored, making retrieval over millions of items fast through approximate nearest-neighbour search. A cross-encoder passes the query and a candidate document through the model together and outputs a single relevance score, which is more accurate but far too slow to run over a whole corpus. Production pipelines use the bi-encoder to fetch a shortlist of around one hundred candidates, then the cross-encoder to re-rank only that shortlist.

Chunks that are too large dilute the embedding with mixed topics, so a relevant passage is averaged out and ranked poorly, while chunks that are too small lose the surrounding context the model needs to answer. Most production systems land between roughly two hundred and five hundred tokens per chunk with a small overlap so a sentence split across a boundary still appears whole in one chunk. The right size depends on document structure and is set empirically against an evaluation set rather than guessed.

Latency accumulates across embedding the query, the vector search itself, the cross-encoder re-ranking step, and the generation call to the language model, which usually dominates. Re-ranking adds real time because the cross-encoder runs one forward pass per candidate, so the shortlist size is a direct latency lever. The generation step is typically the largest single component, which is why streaming the model output and keeping the retrieved context tight matter most for perceived speed.