LLM Model Routing: Cut AI Costs Without Losing Quality

LLM model routing sends each incoming request to the cheapest model that can answer it correctly, instead of pushing everything through one expensive frontier model. A small classifier or set of rules inspects the request, scores how hard it is, and picks a model tier to match. SaaS teams that route well typically cut AI spend 40-70% with no measurable drop in answer quality, because most production traffic is simple enough for a cheaper model to handle.

The reason this works is boring and consistent: real traffic is lopsided. A support assistant might see thousands of "where is my invoice" questions for every genuinely hard multi-step reasoning request. If you send both to the same frontier model, you pay frontier prices for the easy 90% that a model a tenth of the cost would have answered identically.

What model routing actually is

Routing is a thin decision layer that sits in front of your model calls. Before the request reaches a model, the router answers one question: which tier should handle this? The tiers usually look like a small fast model for high-frequency simple work, a mid-tier model for standard tasks, and a frontier model reserved for hard reasoning, long-context synthesis, or anything user-facing where a wrong answer is expensive.

This is different from prompt caching, which reuses computation on repeated prefixes, and different from batching, which trades latency for throughput. Routing decides which model runs at all. The three compose well, and most teams that take cost seriously end up using all three. We cover the caching side in our guide to prompt caching for LLM cost savings, and the latency side in inference latency and time to first token.

Why one model for everything is the expensive default

Most AI features start as a single hardcoded model call. That is the right call on day one, when you are proving the feature works. The problem is that the model you picked to make the hardest case work then handles every case, including the trivial ones. Pricing across tiers in 2026 commonly spans 10x to 30x between a small model and a frontier model on input tokens, so the gap you are overpaying on is not small.

Consider a B2B SaaS product with an in-app assistant doing roughly 500,000 model calls a month. Suppose 70% of those are short lookups and classifications, 25% are medium summaries and drafting, and 5% are genuinely hard. Running all of it on a frontier model might cost the team a flat high rate per call. Move the 70% to a small model and the 25% to a mid-tier model, and the blended cost per call drops sharply because the expensive tier now touches only the 5% that needs it. The exact figure depends on token sizes, but a 40-70% reduction in monthly spend is a realistic outcome, and published routing research has shown cost cuts above 80% while holding most of the frontier model's quality on benchmark tasks.

The three signals that decide the model

Good routing is not a giant model picking another model. It is a cheap, fast decision based on a few signals you can compute in milliseconds.

Task type

The strongest signal is what the request is asking for. Classification, extraction, short factual lookups, and format conversion are tasks where small models are close to frontier quality. Open-ended reasoning, multi-step planning, and synthesis across long context are where the frontier model earns its price. A lightweight intent classifier, or even explicit routing by feature (the autocomplete endpoint always uses the small model, the analysis endpoint always uses the frontier one), captures most of the savings.

Input complexity

Length and structure are cheap proxies for difficulty. A 200-token question is rarely as hard as a 12,000-token document analysis. Token count, the number of distinct sub-questions, and whether tool calls are involved all push a request up or down a tier. You do not need a perfect difficulty score; you need a threshold that catches the obvious easy cases.

Stakes and confidence

Not every wrong answer costs the same. A miscategorized internal tag is cheap to fix; a wrong number in a customer-facing financial summary is not. Route high-stakes requests up a tier regardless of how easy they look. A useful pattern is a cascade: try the cheap model first, check its confidence or run a fast validator, and escalate to a stronger model only when the cheap answer fails the check. The cascade pays for itself whenever the cheap model clears the bar, which on real traffic is most of the time.

A routing setup that survives production

Routing is easy to demo and easy to get wrong once real traffic hits it. A few decisions separate a router that saves money from one that quietly degrades quality.

Start with rules, add a classifier later

The first version of a router should be explicit if-then rules tied to feature and input length. Rules are debuggable, predictable, and good enough to capture the bulk of the savings. Reach for a learned classifier only once you have logs showing where rules misroute. A classifier you cannot explain is a liability when a customer asks why an answer regressed.

Make every route observable

Log the chosen tier, the reason it was chosen, the latency, the token counts, and the cost for every request. Without this you cannot tell a healthy router from one sending hard requests to the small model. This is the same discipline that separates teams who trust their agents from teams who get surprised by them, which we unpack in agent observability versus evals.

Gate routing changes behind evals

Every routing rule is a quality bet: this class of request is fine on a cheaper model. Prove it. Run a held-out eval set through both the cheap and expensive tier and compare. If the cheap tier matches on that slice, route it down with confidence. If it does not, keep it on the expensive tier. Routing without an eval harness is guessing, and guessing about cost is how you ship a regression to save a few dollars.

What to measure before and after

Three numbers tell you whether routing is working. The first is blended cost per request, which should fall once routing is live. The second is the quality score on your eval set per tier, which should hold steady. The third is the escalation rate in a cascade setup, which tells you whether your cheap tier is actually handling the load or punting everything upward. If escalation is high, your routing thresholds are too conservative and you are paying for two model calls instead of one.

Watch tail latency too. A cascade that retries on a stronger model adds latency to the requests that escalate. For user-facing features, route latency-sensitive paths straight to a fast model rather than running a cascade that makes the slow case slower. Cost and latency are the same conversation, which is why we treat AI spend and infrastructure together in our notes on AI infrastructure costs.

When routing is not worth it

Routing has real engineering cost: a decision layer, observability, evals per tier, and ongoing tuning as models change. If your AI feature does a few thousand calls a month, the savings will not pay for the complexity. Skip routing, pick one model that handles your hardest case, and revisit when volume grows. Routing earns its place when model spend is a line item someone asks about in a finance review, not before. For a worked example of cutting a real bill, see how one team went from a heavy monthly figure to a fraction of it in LLM cost optimization, 48k to 19k.

Frequently asked questions

Does model routing hurt answer quality?

It should not, if you gate each routing rule behind an eval. The goal is to move only the requests a cheaper model handles as well as the expensive one. Quality drops when teams route by guesswork instead of measuring per-tier accuracy on a held-out set.

How much can model routing save?

For most production SaaS workloads, 40-70% off model spend is a realistic range, because the majority of traffic is simple. Aggressive routing on lopsided traffic, validated against quality benchmarks, has reached 80% or more in published research, but treat the higher figures as a ceiling, not a promise.

Should I build routing or buy an LLM gateway?

Start with simple rules in your own code so you understand your traffic. A gateway or router product helps once you are running many models and want centralized logging, fallback, and spend controls. Either way, the routing logic is only as good as the evals behind it.

How is routing different from prompt caching?

Caching reuses computation on repeated prompt prefixes within a single model; routing decides which model runs in the first place. They solve different parts of the bill and stack cleanly, so most cost-serious teams use both alongside batching.

Stop sending every request to your most expensive LLM

What model routing actually is

Why one model for everything is the expensive default

The three signals that decide the model

Task type

Input complexity

Stakes and confidence

A routing setup that survives production

Start with rules, add a classifier later

Make every route observable

Gate routing changes behind evals

What to measure before and after

When routing is not worth it

Frequently asked questions

Does model routing hurt answer quality?

How much can model routing save?

Should I build routing or buy an LLM gateway?

How is routing different from prompt caching?

Rather we just build it?

Stop sending every request to your most expensive LLM

What model routing actually is

Why one model for everything is the expensive default

The three signals that decide the model

Task type

Input complexity

Stakes and confidence

A routing setup that survives production

Start with rules, add a classifier later

Make every route observable

Gate routing changes behind evals

What to measure before and after

When routing is not worth it

Frequently asked questions

Does model routing hurt answer quality?

How much can model routing save?

Should I build routing or buy an LLM gateway?

How is routing different from prompt caching?

Keep reading

Rather we just build it?