← Back to writing

How to reduce LLM hallucinations in production RAG

A retrieval-augmented chatbot can pass every demo and still invent a refund policy the day a customer asks. Hallucinations are rarely a model problem you can prompt away. In production they are a pipeline problem: the wrong context arrives, the model fills the gap, and nothing in the path checks whether the answer is grounded in the retrieved source.

This is a practitioner walkthrough of where hallucinations actually enter a RAG system and the controls that cut them, in the order we add them when we ship a feature.

Most production hallucinations start in retrieval, not generation

When a model confidently states something false, the instinct is to blame the model. In our experience shipping RAG features, the more common cause is that the retriever handed the model thin or off-topic context, and the model did what language models do: it produced a fluent, plausible completion to fill the hole.

That reframes the fix. Before touching the prompt, measure retrieval. Track recall@k on a small labelled set of real questions: for each question, does the chunk that actually answers it appear in the top k passed to the model? If recall@5 is sitting at 0.6, four out of ten answers are being generated with no grounding present, and no prompt instruction will save those. We cover the retrieval mechanics in depth in our production RAG architecture guide.

Fix retrieval quality first

Two changes move recall the most. The first is hybrid search: combine dense embeddings with sparse keyword matching (BM25) so exact terms, product names, and error codes are not lost in vector space. The second is a reranking stage that re-scores the top candidates with a cross-encoder before they reach the model. In published 2026 benchmarks, a hybrid-plus-rerank pipeline reached recall@5 of about 0.82 versus 0.59 for dense retrieval alone. Higher recall means the grounding the model needs is more often in front of it.

Make grounding the default, not a suggestion

Once the right context is retrieved, the generation step still has to use it. Three controls keep the model inside the source material.

Require citations per claim

Instruct the model to attach the source passage id to each factual sentence and to answer only from the supplied context. This does two things: it nudges the model toward extraction over invention, and it gives you a machine-checkable signal. If a sentence cites nothing, you can flag or drop it before it reaches the user.

Add contextual compression

Passing ten full documents dilutes the signal and invites the model to pattern-match on the wrong passage. Compress retrieved context to the sentences that actually bear on the question before generation. Less noise in, fewer confident detours out.

Give the model permission to say "I do not know"

Many hallucinations are the model refusing to abstain. If the prompt and the few-shot examples never show an "insufficient context" answer, the model assumes one is never acceptable. Show it that abstaining is a valid, rewarded response when the retrieved context does not contain the answer.

Verify the answer before the user sees it

Grounded prompting reduces hallucinations; it does not eliminate them. For anything customer-facing or regulated, add a verification pass between generation and display.

A grounding verifier is a second, cheaper model call that takes the answer plus the retrieved context and classifies each claim as supported, partly supported, or unsupported. Unsupported claims get stripped, or the whole answer is regenerated with a stricter instruction. This is the same separation of concerns we apply to RAG evaluation: the system that writes the answer should not be the only system that judges it.

Confidence thresholds add a second gate. When the verifier or the retriever's own scores fall below a calibrated line, route to a fallback: a narrower canned answer, a human handoff, or a request for clarification. A system that gracefully admits uncertainty on five percent of queries beats one that confidently makes things up on the same five percent.

Close the loop with evals, not vibes

You cannot improve what you only notice anecdotally. Maintain a regression set of real questions with known-good answers and a few known traps where the right response is "not enough information." Run it on every prompt, model, or retrieval change. This is the difference between knowing a change reduced hallucinations and hoping it did. The same discipline separates a demo from a product, which is the theme of why RAG alone is not enough for most real features.

If you want to compare the grounding tradeoffs of RAG against tuning the model itself, our breakdown of RAG versus fine-tuning walks through when each approach earns its keep.

Frequently asked questions

Can prompt engineering alone stop hallucinations?

No. Prompting helps the model use the context it receives, but if retrieval delivers the wrong context, no instruction recovers a fact that was never present. Fix retrieval recall first, then tighten the prompt.

What is a realistic hallucination rate for a production RAG feature?

It depends on the domain and how strict your verifier is, but the goal is not zero at any cost. A practical target is a low single-digit rate of unsupported claims plus a clear abstain path for low-confidence queries, measured on a fixed regression set rather than estimated from spot checks.

Does a verification pass make the system too slow or expensive?

A verifier call adds latency and cost, but it can use a smaller, cheaper model than the generator and only needs to run on the final answer. For customer-facing answers the tradeoff usually favors verification; for low-stakes internal tools you may skip it.

How do I know retrieval is the problem and not the model?

Log the retrieved context alongside each hallucinated answer. If the answer the model needed was not in the retrieved passages, that is a retrieval failure. If it was present and the model still strayed, that is a generation or prompting failure. The fix differs, so measure before you choose.

Most teams do not need a bigger model to stop hallucinating. They need a retriever that returns the right grounding, a prompt that forces the model to use it, and a verifier that catches what slips through. If you have an AI feature stuck on accuracy, see how we ship production AI features for US SaaS teams.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.