← Back to writing

Reranking in RAG: the two-stage retrieval that lifts recall

If your RAG answers are vague or miss the obvious passage, the fastest win is usually not a better embedding model or a bigger context window. It is adding a second retrieval stage: retrieve broadly with a fast vector search, then rerank the top candidates with a cross-encoder before they reach the language model.

This is the "retrieve broadly, rank precisely" pattern. Here is why it works, what it costs, and how to size it for production.

Why one-stage retrieval leaves recall on the table

A standard RAG pipeline embeds the query, finds the nearest chunks by cosine similarity, and passes the top k to the model. That bi-encoder search is fast because query and documents are embedded independently and compared with a cheap distance metric. The cost of that speed is precision: the model never compares the query and a candidate passage together, so it ranks on coarse semantic similarity rather than actual relevance.

The result is a familiar failure mode. The passage that answers the question is retrieved, but it sits at rank 8, below three loosely related chunks. If you only pass the top 5 to the model, the answer never arrives. Raising k to 20 floods the prompt with noise and, as we covered in the discussion of reducing hallucinations, more noise invites more invention.

What a reranker actually does

A cross-encoder reranker takes the query and one candidate passage together as a single input and outputs a relevance score. Because it attends to both at once, it captures fine-grained signal a bi-encoder cannot: negation, qualifiers, whether the passage answers this specific question rather than the general topic.

The two stages divide the labor. Stage one (vector or hybrid search) gives you recall: cast a wide net, retrieve 20 to 30 candidates cheaply. Stage two (the cross-encoder) gives you precision: re-score those candidates and keep the best 3 to 5. Neither stage alone is sufficient. The bi-encoder is too coarse to rank precisely; the cross-encoder is far too slow to run against your whole index.

The numbers from 2026 benchmarks

Published 2026 retrieval benchmarks make the gain concrete. A hybrid retrieval stage followed by a neural reranker reached recall@5 of about 0.82, against 0.70 for hybrid fusion alone, 0.64 for BM25, and 0.59 for dense retrieval on its own. Ranking quality moved even more: mean reciprocal rank at 3 jumped from roughly 0.43 to 0.61. That MRR gain is what users feel as "it puts the right answer first."

Sizing the reranker for production

Two settings decide whether reranking helps or just adds latency.

How many candidates to rerank

The practical window is 10 to 30 candidates passed from stage one to the reranker. Below 10, you risk dropping a relevant chunk that the bi-encoder ranked low, which defeats the purpose. Above 50, cross-encoder latency grows roughly linearly while the marginal recall gain flattens. Start at 20 and tune against your eval set.

Managed versus self-hosted

For the lowest-friction path, a managed reranking API gets you a strong cross-encoder with one call and no GPU to operate. If licensing or data residency rules it out, an open-weight reranker self-hosted on a modest GPU is a well-trodden path. The decision mirrors other build-versus-buy infrastructure calls in AI; our vector database comparison walks through the same tradeoff for the storage layer.

Budget the latency

Reranking adds a synchronous step before generation, so it lands directly on time-to-first-token. Reranking 20 candidates typically adds tens of milliseconds with a managed API, more if you self-host on shared hardware. If your latency budget is tight, read our notes on inference latency and time to first token before you wire it in, and consider reranking only when stage-one confidence is low.

When reranking is not the right fix

Reranking improves the ordering of candidates that retrieval already found. It cannot surface a passage that stage one never retrieved, and it cannot repair badly chunked documents where the answer is split across boundaries. If recall@30 is already poor, fix chunking and hybrid search first. And if your evals are not catching ranking regressions, no amount of reranking will tell you whether a change helped; the discipline of a fixed regression set still applies, as we argue in common RAG evaluation mistakes.

Frequently asked questions

Does reranking replace hybrid search?

No, they stack. Hybrid search (dense plus BM25) improves what stage one retrieves; reranking improves how stage two orders it. The best 2026 numbers come from doing both, not choosing between them.

How much latency does a reranker add?

Reranking 20 candidates with a managed cross-encoder API usually adds tens of milliseconds. Self-hosting on shared hardware can add more. Because it sits before generation, budget it as part of time-to-first-token and rerank fewer candidates if the budget is tight.

How many candidates should I pass to the reranker?

Start at 20. The useful range is 10 to 30: fewer risks dropping relevant chunks, more adds latency with diminishing recall gains. Tune the exact number against your own evaluation set.

Will a better embedding model make reranking unnecessary?

Rarely. A stronger embedder improves stage-one recall, which is valuable, but bi-encoders still rank coarsely because they never compare query and passage together. The cross-encoder's joint scoring is a different mechanism, which is why the two stages compound.

Two-stage retrieval is one of the highest-leverage upgrades you can make to an existing RAG feature: a clear recall gain for a bounded latency cost. If you want it added to a pipeline you already run, see how we ship production AI work for US SaaS teams.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.