Most LLM queries do not need your most expensive model
Look at your LLM bill and you will usually find the same pattern: one frontier model handling every request, from a one-line classification to a multi-step reasoning task. You are paying the premium rate for work a model that costs 15 to 20 times less could do just as well. Model routing fixes that by sending each request to the cheapest model that can actually answer it, and it is one of the highest-leverage cost changes a SaaS team can ship in a week.
The short version, for anyone skimming:
- Most production queries, an estimated 60 to 80 percent, are simple enough for a small, cheap model to handle correctly.
- There is roughly a 15 to 20x price gap between a provider's premium tier and its economy tier.
- Routing and cascading can cut LLM spend 45 to 85 percent while keeping around 95 percent of the quality of an all-premium setup.
- The three patterns are pre-request routing (cheapest to run), at-inference cascades (most accurate), and post-response retry (the safety net).
Where the money actually goes
When a product ships its first AI feature, the default is to wire every call to the strongest available model because it is the safest way to make the demo work. That decision quietly sets your unit economics. Every request now costs the frontier rate, whether it was a sentiment tag, a routing decision, or a genuinely hard synthesis task. Because request difficulty is not uniform but your model choice is, you overpay on the large majority of traffic that never needed the big model.
The gap is not small. Across providers in 2026, the premium reasoning tier runs roughly 15 to 20 times the price of the same provider's economy model, and the spread is even wider on output tokens, which are typically priced 4 to 8 times higher than input tokens. If 70 percent of your traffic is simple and you are paying 15x more than you need on all of it, routing is not an optimization at the margin. It is most of your bill.
The three ways to route
Routing is a spectrum, not a single algorithm. The three approaches trade accuracy against how much extra latency and complexity you take on.
Pre-request routing
A lightweight classifier looks at the incoming request and picks a model before any expensive call happens. The classifier can be a small model, a set of rules, or an embedding-based similarity check against known query types. This is the cheapest pattern to run because you only ever make one model call. The risk is misclassification: if the router sends a hard query to the small model, the user gets a worse answer and you never find out unless you are measuring.
At-inference cascades
A cascade tries the cheap model first, checks the output against a quality signal, and escalates to a stronger model only when the cheap answer fails the check. This is the most accurate pattern because the decision is based on a real attempt, not a guess about difficulty. The quality check can be the model's own confidence, a validator that confirms the output parses and matches a schema, or a small judge model. You pay the premium only on the fraction of queries the cheap model could not handle.
Post-response retry and fallback
The safety net. When a call fails outright, times out, or hits a rate limit, you retry against a different deployment or provider. This is less about cost and more about reliability, but it belongs in the same routing layer because it uses the same plumbing. A good routing layer does all three: rules first, cascade when accuracy matters, retry when a provider misbehaves.
A cascade that pays for itself
Here is the math that makes cascades worth the plumbing. Say you handle 1,000,000 requests a month and each averages 500 input and 300 output tokens. If a premium model costs 15x the economy model, and 70 percent of requests are handled correctly by the economy model on the first try, you route those 700,000 requests at the cheap rate and only escalate the remaining 300,000.
The catch is that escalated queries cost more than an all-premium baseline, because you paid for the cheap attempt and then paid for the premium one. But because escalations are the minority, the blended cost still lands far below the all-premium number. In reported production setups, cascades and routing land in the 45 to 85 percent savings range depending on how skewed the traffic is toward easy queries. The more of your traffic is genuinely simple, the bigger the win. This stacks on top of other cost work; teams that combine routing with prompt caching and the batch API for non-urgent jobs see the largest reductions. We walked through one such stack in our writeup on cutting an LLM bill from 48k to 19k a month.
The failure modes nobody mentions
Routing is not free. The three ways it goes wrong in production:
The first is a drifting quality check. Your cascade's escalation rate is only as good as the signal that decides when to escalate. A confidence threshold tuned in March can start passing bad answers in June as your traffic mix shifts. Log the escalation rate and sample escalated-versus-not answers weekly.
The second is cascade latency. Every escalation means two sequential model calls, so your p95 latency on hard queries roughly doubles. If those hard queries are on a user-facing path, that shows up as a slower product. Our note on inference latency and time to first token covers how to keep the perceived speed acceptable while a cascade runs.
The third is over-escalation. If the cheap model is a little too weak for your domain, the cascade escalates on most queries and you get the worst of both worlds: two calls per request and almost no savings. The fix is picking the right economy model for your specific traffic, not the cheapest one on the price sheet.
How to add routing without a rewrite
You do not need to rebuild your app to route. The common pattern is a gateway or proxy that sits between your code and the providers, exposes one API, and holds the routing rules, retries, and per-team budgets in one place. Open-source gateways handle this, so your application keeps calling a single endpoint while the routing logic lives in config.
Start by logging, not routing. For two weeks, record which requests would have been handled by a cheaper model, using a shadow classifier that does not affect the live response. That gives you the real easy-versus-hard split for your traffic instead of a blog estimate. Then build a small eval set of representative queries, turn on the cheap model for the clearly-easy classes, and add a cascade for the ambiguous ones. Measure blended cost and answer quality on your own data before and after. If you cannot see the quality number hold while the cost number drops, do not ship it. The choice between providers underneath the router is its own decision; our Bedrock versus OpenAI cost comparison covers the tradeoffs there.
Frequently asked questions
Will routing to a cheaper model hurt answer quality?
Only if you route badly. The point of a cascade is that the cheap model handles the queries it can handle and escalates the rest, so quality stays close to an all-premium setup while cost drops. The number to watch is the quality of escalated-versus-not answers on your own eval set; if easy queries degrade, your router or your economy model is wrong.
What is the difference between routing and a cascade?
Routing picks a model before any call, based on a guess about difficulty, so it runs one call and is cheapest. A cascade tries the cheap model first and escalates on a failed quality check, so it is more accurate but can cost two calls on hard queries. Most production systems use routing for obvious cases and a cascade for the ambiguous middle.
How much can model routing realistically save?
Reported production setups land in the 45 to 85 percent range, driven mostly by how much of your traffic is genuinely simple. If 70 to 80 percent of your queries are easy, savings are at the high end. If your traffic is mostly hard reasoning tasks, routing helps far less.
Do I need a separate tool to route?
No, but a gateway makes it much easier. You can route in application code, but a proxy layer centralizes the rules, retries, fallbacks, and budgets, and lets you change routing without redeploying your app. It also gives you the per-request logging you need to tune the router in the first place.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.