Production RAG Pipeline: Chunking, Re-ranking, Latency
Table of Contents
Table of Contents
Share

Production RAG pipeline design: chunking strategies, bi-encoder retrieval, cross-encoder re-ranking, and latency budget tactics for enterprise AI in 2024.
Frequently Asked Questions
- A bi-encoder encodes the query and each document independently into vectors, so document embeddings can be computed once and stored, making retrieval over millions of items fast through approximate nearest-neighbour search. A cross-encoder passes the query and a candidate document through the model together and outputs a single relevance score, which is more accurate but far too slow to run over a whole corpus. Production pipelines use the bi-encoder to fetch a shortlist of around one hundred candidates, then the cross-encoder to re-rank only that shortlist.
- Chunks that are too large dilute the embedding with mixed topics, so a relevant passage is averaged out and ranked poorly, while chunks that are too small lose the surrounding context the model needs to answer. Most production systems land between roughly two hundred and five hundred tokens per chunk with a small overlap so a sentence split across a boundary still appears whole in one chunk. The right size depends on document structure and is set empirically against an evaluation set rather than guessed.
- Latency accumulates across embedding the query, the vector search itself, the cross-encoder re-ranking step, and the generation call to the language model, which usually dominates. Re-ranking adds real time because the cross-encoder runs one forward pass per candidate, so the shortlist size is a direct latency lever. The generation step is typically the largest single component, which is why streaming the model output and keeping the retrieved context tight matter most for perceived speed.
Don't Miss What's Next
Subscribe to newsletter
AI/ML
RAG
AI Infrastructure
Get in Touch
Our team will get back to you within 24 hours.
















