Reranking in on-prem RAG: when it lifts relevance and when it just burns GPU
A reranker sharpens top-k ordering when queries are long and the corpus is dense. The numbers, the VRAM and latency cost, and five setups where it backfires.

Reranking in on-prem RAG: when it lifts relevance and when it just burns GPU
Reading time: about 9 min. Author: Fryderyk
A reranker lifts top-k relevance in RAG when the first-stage retriever returns many loosely ordered candidates and the queries are long and ambiguous. For short, factual queries over a well-chunked corpus, the gain often sits within the noise, while the cost is real: an extra model, extra latency, extra VRAM. The verdict is plain. Measure your retriever first, then add a cross-encoder. A reranker only partly compensates for weak retrieval, and never for free.
Below we break it down into numbers: what a reranker actually moves in the metrics, what it costs in latency and memory, and which setups make a naive cross-encoder backfire.
Contents
- Two retrieval stages, one reason rerankers exist
- What a reranker moves in the metrics (table)
- Cost: latency and VRAM (table)
- Why the naive setup backfires
- When a reranker pays off and when it does not
- What this post does not cover
- Disclosure and biases
- Related
Two retrieval stages, one reason rerankers exist
A classic RAG pipeline has two stages. The first stage (retriever) scans the whole corpus and returns a few dozen candidates. It is fast because it compares query and document vectors (dense) or matches terms (BM25, sparse). The second stage (reranker) takes that short list, scores each candidate together with the query in a single model pass, and reorders the list.
The difference is architectural. A dense retriever encodes query and document separately (bi-encoder), so document embeddings can be computed once and stored in the index. A reranker is a cross-encoder: it feeds query and document into the model together, so nothing can be precomputed. That is the whole economics of the decision. The reranker sees the token-by-token interaction between question and text that a bi-encoder cannot, but it pays with a model pass per pair.
The operational takeaway: a reranker makes sense as a narrow second sieve over a list the first stage already trimmed to a few dozen items. Running a cross-encoder over the entire corpus is a non-starter on cost.
What a reranker moves in the metrics
We usually track nDCG@10 (how well the top ten is ordered) and Recall@k (how many relevant documents made it into top-k at all). A reranker mainly improves ordering, that is nDCG, not Recall itself. If a relevant document never entered the first-stage candidate list, the reranker cannot conjure it back.
The values below are typical orders of magnitude we observed in tests on a corpus of technical documents in Polish and English. Your numbers will depend on domain and chunking quality, so treat them as direction, not promise.
| Setup | nDCG@10 | Recall@20 | Note |
|---|---|---|---|
| Dense alone (bi-encoder) | baseline | baseline | fast, weaker on ambiguous queries |
| BM25 alone | slightly lower on semantics | comparable on proper names | strong on codes, symbols, identifiers |
| Hybrid (dense plus BM25) | plus 5 to 10 pts | plus 5 to 12 pts | best gain-to-cost ratio |
| Hybrid plus reranker | extra plus 6 to 15 pts nDCG | no meaningful change | gain mostly in top-5 ordering |
The key reading: reranking adds to nDCG and barely touches Recall. If the problem is that relevant documents never enter the list, a reranker is the wrong tool. Then you work on chunking, the hybrid, and the first-stage k threshold.
Cost: latency and VRAM
A cross-encoder scores a query-document pair in a full model pass. Cost grows linearly with the number of candidates and with text length. That is exactly the variable easiest to miss in a demo (where k is small) and painful in production (where k tends to be large).
Below are orders of magnitude for a small or base class reranker, run on-prem on a single card. Numbers depend on chunk length and precision (fp16 vs int8), so they are indicative.
| Parameter | Reranker small (about 20 to 30 M) | Reranker base (about 100 to 300 M) |
|---|---|---|
| VRAM at fp16 | hundreds of MB | roughly 1 to 2 GB |
| Latency over 50 pairs, short chunks | tens of ms | tens to over 100 ms |
| Latency over 50 pairs, long chunks | grows 2 to 4 times | grows 2 to 4 times |
| Sensitivity to chunk length | high | high |
Two things worth remembering. First, latency scales with k times length, so controlling both numbers is cheaper than a bigger card. Second, a reranker usually fits in VRAM next to the embedding model, but on a single box shared with a generation LLM, memory contention is easy to hit. You plan for that up front, you do not discover it after deployment.
Why the naive setup backfires
The most common mistakes that turn a reranker from a gain into dead weight:
- Reranking too long a list. Running a cross-encoder over 200 candidates instead of 30 to 50 multiplies latency without proportional quality gain. The top positions rarely change below a certain k anyway.
- A reranker on top of a weak first stage. If first-stage Recall@k is low, the reranker neatly orders a list that is missing the right document. Looks better, answers worse.
- Chunks that are too long. A cross-encoder has a context limit and cost grows with length. Chunks near the limit get truncated (you lose signal) or are expensive (you pay for every token in every pair).
- No before-and-after measurement. Wiring in a reranker because it is the done thing, without offline eval on your own question set, is guesswork. Often the hybrid without a reranker delivers 80 percent of the gain at 20 percent of the cost.
- Query-side reranking with no cache. The same or very similar queries scored from scratch every time. A simple result cache at the query level is often cheaper than model optimization.
When a reranker pays off and when it does not
A reranker usually pays off when: queries are long and descriptive, the corpus is large and topically dense (many similar documents), and the cost of a wrong answer is high (technical documentation, procedures, compliance). Under those conditions, better top-5 ordering translates directly into LLM answer quality, because less noise reaches generation.
A reranker usually does not pay off when: queries are short and unambiguous, the corpus is small or well-chunked, and the latency budget is tight (an interactive assistant with a hard SLA on response time). There, a solid dense plus BM25 hybrid and work on chunking return more.
Rule of thumb: hybrid and good chunking first, measure nDCG and Recall, then switch on a reranker as a deliberate quality-for-latency trade-off, not as a default pipeline element.
What this post does not cover
- Specific reranking model names and their licenses (a separate review of open-weight model vendors).
- Fine-tuning a reranker on your own domain (a separate topic, a different risk and cost profile).
- Late interaction (for example multi-vector architectures) as an alternative to a classic cross-encoder.
- End-to-end evaluation of LLM answer quality (RAG eval), as distinct from retrieval quality alone.
- GPU sizing for LLM generation, which we cover separately.
// disclosure & biasesDisclosure and biases
We write from the perspective of on-prem deployments for regulated manufacturing, so we naturally emphasize cost control, predictable latency, and no dependence on an external API. The numbers in the tables are orders of magnitude from our tests on technical document corpora in PL and EN, not a published benchmark. Your domain, language, and chunking quality may yield different results. Treat this as a decision frame, not as ready values to copy into a business case.
Related
Building CortexMine, an on-prem AI platform for European manufacturers under NIS2. Where this bias could affect conclusions, it is flagged inline.
On-prem RAG: architecture, chunking, retrieval and what actually drives quality
How to build RAG outside the public cloud: the pipeline layers, the most common retrieval failures, the data boundary inside the prompt, and the questions an auditor will ask. A technical note for architects and CISOs.
GPU sizing for Llama 3.1 70B inference: numbers from benchmarks
How many GPUs does it really take to run Llama 3.1 70B in-house? Concrete configs (A100, H100, H200), the impact of quantization (FP16 → FP8 → INT4), tokens/s, TTFT, and cost per 1M tokens. No marketing — numbers from vLLM and TensorRT-LLM benchmarks.