Why pure vector search misses the queries that matter
Your retrieval-augmented generation demo answered every test question. Then a customer searched for an exact SKU, an error code, or a contract clause number, and the model confidently returned the wrong document. The problem was not bad embeddings. Dense vector search and keyword search fail in opposite directions, and a production retrieval system needs both.
The short answer, for anyone skimming:
- Dense embedding search captures meaning and paraphrasing but loses exact terms, identifiers, and numbers it never saw clearly during training.
- BM25 keyword search nails exact matches but misses synonyms and reworded questions.
- Hybrid search runs both retrievers, fuses the results (usually with reciprocal rank fusion), then reranks the top candidates.
- On the WANDS e-commerce benchmark a tuned hybrid setup reached 0.7497 NDCG, about a 7.4 percent lift over either method alone.
Where pure vector search breaks
Dense retrieval turns a query and your documents into vectors and finds the nearest neighbors by cosine similarity. That is excellent when the user asks a question in different words than the source text. It is weak in three places that matter constantly in real products.
Exact identifiers and codes
Product IDs, SKUs, SSO error codes, API endpoint names, and version strings are often rare tokens. An embedding model that saw them rarely or never places them in fuzzy regions of the vector space, so a search for the error code PG-1142 can rank a vaguely related troubleshooting page above the exact page that names the code. BM25 treats that code as a literal term and goes straight to the right document.
Numbers and precise values
Embeddings compress meaning, and meaning is not the same as precision. Plans over 500 seats and plans over 50 seats sit close together in vector space even though they point at different rows in your pricing table. Lexical matching keeps the digits intact.
Domain jargon the model never learned
On financial documents, BM25 has been shown to outperform dense retrieval even against one of the strongest commercial embedding models available in 2026. Legal, medical, and internal-tooling corpora are full of terms an open-web embedding model underweights. If your retrieval quality looks fine on general questions and falls apart on the domain-specific ones, this is usually why.
How hybrid retrieval actually works
Hybrid search is not a single algorithm. It is a pattern: run a lexical retriever and a dense retriever in parallel, combine their ranked lists into one, and pass the merged top results to the generator, or to a reranker first.
Reciprocal rank fusion
The hard part of fusing two retrievers is that their scores are on incompatible scales. BM25 scores can run into double digits; cosine similarities sit between -1 and 1. Adding or averaging them directly lets one retriever silently dominate.
Reciprocal rank fusion (RRF) sidesteps this by ignoring the raw scores and using only the rank position. Each document gets a score of 1 divided by (k plus its rank) in each list, and the per-list scores are summed. A document that ranks near the top of both lists accumulates the highest combined score. Because RRF is scale-agnostic, you can add a third retriever later without re-tuning a weighting formula. A common starting value for the constant k is 60; treat it as a knob, not a law.
Add a reranker on the top candidates
Fusion gets the right documents into the top 20 to 50. A cross-encoder reranker then reads each candidate together with the query and reorders them by true relevance, which is more accurate than either first-stage retriever because it attends to both texts at once. In reported tests, hybrid retrieval plus reranking reached MRR around 66 percent versus about 57 percent for semantic-only, roughly a 9-point gain. The cost is latency and dollars, so you rerank tens of candidates, not thousands. We cover that setup in our note on two-stage retrieval with rerankers.
The numbers worth quoting to your team
The case for hybrid is not vibes. On the WANDS e-commerce benchmark, a tuned hybrid configuration reached 0.7497 NDCG against 0.6983 for BM25 alone and 0.6953 for pure vector search, a 7.4 percent lift over the better of the two. The pattern repeats across domains: the two retrievers fail on different queries, so combining them recovers recall that either one drops.
What does that buy you in product terms? Fewer answers grounded in the wrong chunk, fewer not-found responses on queries that contain an exact term, and a measurable drop in the hallucinations that come from feeding the model an irrelevant passage. If you want the failure taxonomy, see the RAG evaluation mistakes teams make when they only test the happy path.
When to reach for hybrid, and when not
Hybrid is the right default for most production RAG over mixed content: docs with code, support tickets, catalogs, contracts, and anything with identifiers. It is overkill if your corpus is short, purely conversational prose with no exact terms, and your latency budget is tight, since you are now running two retrievers and a reranker per query.
There is no universal recipe for the BM25-to-dense mix or the reranking depth. Legal, code, and customer-support corpora each want different tuning, and you find the setting by measuring recall, latency, and cost on your own queries rather than copying a blog default. Before you tune retrieval at all, make sure your chunking is sound, because fusion cannot rescue a chunk that split the answer in half. Our guide to chunking strategies for retrieval quality is the place to start, and the broader 2026 production RAG architecture guide shows where retrieval fits in the full pipeline.
One infrastructure note: hybrid search needs a store that does both lexical and vector retrieval, or two stores you query in parallel. Several databases now handle both in one engine; our vector database comparison walks through the tradeoffs for pgvector, Pinecone, and Qdrant.
How we ship this for SaaS teams
When a team hands us an AI feature with a retrieval quality problem, the first thing we do is build an eval set from real user queries, including the exact-term queries that break vector search. Then we add BM25 alongside the existing embeddings, fuse with RRF, and rerank the top candidates. The change is usually a few days of work and shows up as a recall number going up, not a redesign. The point is to measure first, fuse second, and only then argue about which embedding model to use.
Frequently asked questions
Is hybrid search always better than pure vector search?
Not always, but it is rarely worse on real corpora. The lift is largest when your documents contain exact terms, identifiers, or numbers that users search for directly. On purely conversational text with no exact terms, the gap narrows and the extra latency may not be worth it.
What is reciprocal rank fusion and why use it over weighted scores?
RRF combines ranked lists using each document's rank position rather than its raw score, scoring it as 1 over (k plus rank) and summing across lists. Because BM25 scores and cosine similarities live on different scales, weighted addition lets one retriever dominate. RRF avoids that and lets you add retrievers without re-tuning.
Do I still need a reranker if I use hybrid search?
Fusion gets the right documents into the top candidates; a reranker reorders those candidates by true relevance and typically adds several points of MRR. It is optional but usually worth it, and you only run it on the top tens of results to keep latency and cost bounded.
How do I know hybrid search improved anything?
Measure retrieval metrics such as recall at k and NDCG on a labeled set of your own queries before and after, and watch end-to-end answer quality. If you cannot see the number move, you cannot defend the added complexity.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.