← Back to writing

Agentic RAG: when retrieval should be a loop, not a pipeline

Agentic RAG turns retrieval from a single fixed step into a control loop: the model decides whether to search, reformulates the query, retrieves again when the evidence is thin, and stops once it has enough to answer. It handles harder, multi-hop questions than a one-shot pipeline, but a complex query can cost 20 to 40 times more than a plain lookup, so the real engineering work is in the stop conditions and in routing most traffic away from the loop entirely. Reach for it where a single retrieval pass demonstrably fails, not everywhere.

Static RAG and agentic RAG, side by side

Classic retrieval-augmented generation is a straight line. A user question goes in, you embed it, pull the top-k chunks from a vector store, paste them into the prompt, and generate an answer. The retriever runs exactly once. It never asks whether the chunks it found are any good, and it never tries a second time. For a clean single-hop question - "what is the refund window on the Pro plan" - that line is fast, cheap, and predictable. A well-built static RAG pipeline answers those in a few hundred milliseconds for a fraction of a cent.

Agentic RAG wraps a reasoning loop around that same retrieval step. Instead of one shot, the model treats search as a tool it can call repeatedly. It plans an approach, retrieves, reads what came back, judges whether the evidence is sufficient, reformulates the query if it is not, and searches again. Retrieval stops being a preprocessing step and becomes a decision the model makes mid-flight. The loop ends when a critique step is satisfied or a hard limit trips.

The trade is plain: you answer questions a single pass cannot reach, and pay for it in tokens, latency, and a wider set of failure modes.

When a single retrieval pass stops working

Static RAG breaks on a predictable class of questions, and it breaks quietly - it returns a confident answer built on the wrong chunks rather than an error. Three patterns force the upgrade.

Multi-hop questions

"Why was my invoice higher this month than last, and does my current plan cover the overage?" needs at least three retrievals: this month's usage, last month's usage, and the plan's overage terms. A single embedding of that whole sentence retrieves a muddle of all three and answers none of them well. The agent has to decompose the question and retrieve each part.

Queries where the first search misses

Users do not phrase questions the way your docs are written. Someone asks about "the thing that limits how many API calls I get," and your documentation calls it "rate limiting." A one-shot retriever scores poorly and gives up. A loop notices the weak match, reformulates toward the vocabulary in the corpus, and tries again.

Synthesis across sources

"Summarize every open incident touching the billing service and who owns each" pulls from incident records, service ownership, and status. No single chunk holds the answer. The agent gathers, checks for gaps, and fills them before it writes.

The cost the loop adds, and where it hides

The honest pitch for agentic RAG includes the bill. Each iteration is another retrieval plus another model call to judge the result. In our measurements on customer systems, a vector search round runs 200 to 500ms, and a two-stage rerank step adds another 300 to 800ms on top. Three rounds and you have spent two to four seconds in retrieval alone, before the final answer is generated.

Tokens scale worse than latency. A reflection loop that reads, critiques, and re-searches typically burns three to ten times the tokens of a single-pass answer, because every chunk it retrieves is read again on the next reasoning step. A simple question that happens to enter the loop costs about the same as static RAG. A genuinely hard one that runs four rounds can cost 20 to 40 times more. The cost is query-dependent and invisible until you read per-request traces. Production traffic has a long tail of vague, multi-part questions that each spin the loop, and that tail is where the monthly bill lives.

Without per-request instrumentation you cannot manage this. If you cannot see tokens and iteration count per query, you are flying blind, the same gap we wrote about in observability versus evals: traces tell you a loop ran, evals tell you whether the extra rounds improved the answer.

Stop conditions are the actual product

The retrieval strategy is the easy half. The hard half - the part that decides whether this ships or quietly bankrupts a feature - is when the loop stops. Without explicit limits, an agent that cannot find a good answer will keep reformulating and re-searching until it exhausts a context window or a budget. Three controls keep it bounded.

A hard iteration cap

Set a maximum number of retrieval rounds and enforce it in code, not in the prompt. Three is the right default for most knowledge tasks; we rarely see a fourth round change the answer enough to justify its cost. If the agent hits the cap without a confident answer, it should say what it could not find rather than fabricate, which ties directly into how you reduce hallucinations in a production system.

Confidence-based early exit

The cap is a ceiling, not a target. Most questions should finish in one or two rounds. After each round, have the model score whether the retrieved evidence answers the question. If the score is high after the first pass, stop there and generate. Early exit is what keeps the average cost near static RAG even though the worst case is far higher.

A per-request token budget

Track tokens spent inside a single request and abort the loop when it crosses a threshold. Log the spend per step so you can alert before a class of queries starts eating the budget rather than after the invoice arrives. A runaway query should fail fast and visibly, not silently drain the account.

Route first, loop second

The single biggest lever on cost is keeping queries out of the loop entirely. Most production traffic is single-hop; if every question enters the agentic path, you pay loop prices for lookups a one-shot retriever answers perfectly. So put a cheap classifier in front and route.

A simple router reads the incoming question and sends it down one of two paths. Single-hop, factual, single-source questions go to static RAG. Multi-hop, synthesis, or "compare and explain" questions go to the loop. The classifier itself should be a small, fast model or even a rules pass - the same logic we use for model routing to cut AI costs, applied to retrieval depth instead of model size. Get the routing wrong toward static and a hard question gets a thin answer; get it wrong toward agentic and you overpay. Tune the boundary with real traffic, not guesses.

Chunking and index quality still matter underneath both paths. An agentic loop cannot reason its way around chunks that split a table in half; good chunking strategies reduce how often the loop has to run at all.

A worked example: a B2B billing assistant

Take a support assistant for a SaaS billing product. Two real questions show the split.

"What payment methods do you accept?" is single-hop. The router sends it to static RAG, which retrieves one chunk from the payments doc and answers in under 400ms for a fraction of a cent. Sending that question into a loop would be pure waste.

"My invoice jumped from 340 dollars to 690 this month, is that a bug or did I cross a tier?" is multi-hop. The agent retrieves this month's usage, retrieves the prior month for comparison, retrieves the pricing tiers, notices it has usage and tiers but no record of plan changes, reformulates for "plan change history," confirms the customer crossed into a higher usage band, and answers with the line item that moved. Four retrievals, roughly 30 times the cost of the payment-methods question - and worth it, because the alternative is a wrong answer that becomes a support ticket.

The point is not that agentic RAG is better; one of these questions needs the loop and the other is destroyed by it, and your system has to tell them apart before it spends a cent.

Rolling it out without the bill running away

Ship agentic RAG the way you would ship any expensive control flow: behind a measurement layer, with the static path as the safe default. Start with everything on static RAG and find the questions it answers badly by sampling failures, not guessing. Add the agentic path only for that segment, cap it at three iterations, and gate the rollout on an eval set that scores answer quality, not just whether the loop completed - the cheapest way to get this right is to avoid the common RAG eval mistakes before they reach customers. Watch the per-request cost distribution for a week before you widen the gate.

Done this way, agentic RAG is a targeted upgrade for the slice of traffic that needs multi-step reasoning, not a wholesale replacement for retrieval that already works. The teams that get burned are the ones that turned the loop on everywhere and met the long tail of vague questions at the end of the month.

FAQ

Is agentic RAG just RAG with more steps?

Functionally it is RAG where retrieval is a tool the model can call repeatedly and decide about, rather than a fixed step that runs once. The added value is the model's ability to judge its own evidence and search again. The added risk is unbounded cost, which is why stop conditions matter more than the retrieval technique itself.

How many retrieval iterations should I allow?

Three as a hard cap for most knowledge work, with confidence-based early exit so the average case finishes in one or two. A fourth round rarely changes the answer enough to justify its tokens. Set the cap in code and have the agent report what it could not find when it hits the ceiling.

How much more does agentic RAG cost than static RAG?

It depends entirely on the query. A simple question that enters the loop costs about the same as static RAG. A genuinely multi-hop question running three or four rounds can cost 20 to 40 times more and add a few seconds of latency. Routing most traffic to the static path is what keeps the blended cost reasonable.

When should I not use agentic RAG?

Skip it for high-volume, single-hop, latency-sensitive lookups where a one-shot pipeline already answers correctly. The loop adds cost and failure surface with no benefit there. Reserve it for multi-hop, synthesis, and cross-source questions where a single retrieval pass measurably fails.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.