The AI gateway: one path for every LLM call in production
An AI gateway is one internal service that every LLM call in your product routes through, instead of each feature calling the provider SDK directly. It owns the cross-cutting concerns that otherwise get copy-pasted across the codebase: rate limiting, retries, caching, model routing, cost tracking, and audit logging. If your top production LLM error is a 429 and nobody can say which customer or feature is burning the budget, a gateway is the missing layer.
Datadog's 2026 telemetry put rate-limit errors at roughly 60 percent of all failed LLM calls in early 2026. That is not a provider problem you wait out. It is an architecture problem, and it shows up precisely because most teams scatter their LLM plumbing across dozens of call sites.
What an AI gateway actually is
Think of it as a reverse proxy for model calls. Application code stops importing the OpenAI or Anthropic SDK directly. Instead it calls your gateway with a plain request: here is the prompt, here is the tenant it belongs to, here is the task type. The gateway decides which provider and model to use, enforces limits, checks the cache, makes the call with the right retry policy, records what it cost, and returns the response.
The value is consolidation. Every concern that has to be consistent across your whole AI surface now lives in exactly one place. When you change your retry backoff or add a new provider, you change one service, not forty call sites that each drifted their own way.
Why scattered LLM calls fall over in production
A feature-by-feature codebase feels fine in a demo and fails under real load. Three failure modes are almost universal.
RPM and TPM are two different ceilings
Provider limits run on two independent axes. Requests per minute (RPM) caps how many calls you make. Tokens per minute (TPM) caps how much compute you consume. You can sit comfortably under RPM and still get a 429 because a handful of long-context RAG requests blew past TPM. For agents and retrieval-heavy features, TPM is almost always the binding constraint, and code that only counts requests never sees it coming.
Naive retries turn one 429 into a storm
The instinctive fix for a 429 is to retry. Done naively, every retry is another request that still counts against the same per-minute budget, so a burst of failures becomes a queue of failures that feeds itself. A single tenant sending a spike can push the whole account into a retry loop that starves every other tenant. Aggressive retries without a circuit breaker are the most expensive bug in a production AI app, because you pay for the failed attempts too.
One noisy tenant starves everyone
Provider rate limits are account-wide. Your customers are not. Without a layer that enforces per-customer budgets, your largest or buggiest tenant consumes the shared ceiling and every other customer's AI feature degrades at the same moment. You cannot see it in per-feature logs because the limit is global and the cause is one account over there.
What the gateway should own
A useful gateway is thin. It is not a platform. It owns a specific set of responsibilities that are painful to keep consistent anywhere else.
Rate limiting and token budgets, enforced per tenant and per feature before the call leaves your network, so one customer cannot exhaust the shared provider ceiling. Retries, provider fallback, and circuit breakers, so a transient failure is handled the same way everywhere; this is the home for the resilience patterns that keep an AI feature up when the provider fails. Model routing, so the gateway can send each request to the cheapest model that can answer it without the caller knowing or caring. Caching, including exact-match and semantic caching for near-duplicate prompts, checked before any paid call. Cost attribution, tagging every call with tenant, feature, and model so you can trace spend and defend per-feature gross margin. Audit logging and redaction, a single choke point where you strip sensitive fields and log what went to the model, which is also where input handling for prompt-injection defense belongs.
None of these are new ideas on their own. The point is that they only work if they are enforced in one place. A per-tenant budget that half your call sites forget to check is not a budget.
How the gateway stops the 429 storm
The specific thing a gateway buys you that a pile of SDK calls cannot is control of the token budget before the request is sent.
Start with pre-flight metering. Estimate the token cost of a request from its prompt and expected output, and check it against the remaining per-minute budget before dispatching. If it does not fit, you queue or shed it deliberately instead of firing it into a 429. This alone removes the worst deadlock, where you queue five hundred requests that all fit RPM, none fit TPM, and the queue never drains.
Layer per-tenant quotas on top. Give each customer a slice of the account budget so a spike from one is contained to that one. The gateway can enforce a soft limit that borrows idle capacity and a hard limit that never does. When a tenant hits its ceiling, its requests wait in that tenant's own queue rather than degrading everyone.
Then add a real circuit breaker. When the provider starts returning 429s despite your budgeting, the breaker opens, requests fail fast or route to a fallback provider, and you stop paying for attempts that will not land. The breaker closes again on a timer once the provider recovers.
A worked example
Picture a B2B SaaS with a support copilot. The numbers here are illustrative, not measured. The copilot answers customer questions over a knowledge base, and a batch job summarizes each closed ticket overnight.
Before the gateway, both features call the provider SDK directly. One evening a large customer imports a backlog and the overnight summarizer fans out thousands of long-context calls at once. They fit RPM but crush TPM. The account starts returning 429s. The copilot, a completely separate feature serving live users, starts failing too, because it shares the same account limit. Both features retry on their own, doubling the load. On-call spends an hour discovering that a background job took down a customer-facing feature, and finance cannot tell which customer caused the spike.
After routing both features through a gateway, the summarizer runs under its own low-priority token budget and gets queued the moment it exceeds it. The copilot has a reserved slice of budget that the batch job cannot touch. Pre-flight metering keeps the account under TPM, so the 429 rate falls toward zero. Cost is tagged per feature and per tenant, so the next morning the team can see exactly what the import cost and, if they want, bill it back. No feature change. One layer.
Build or buy
You do not have to write this from scratch. Mature teams have largely converged on a dedicated gateway, and there are solid options to adopt: LiteLLM, Portkey, and Kong AI Gateway among them. Buying gets you routing, retries, caching, and per-key limits without maintaining them yourself.
Building a thin one is also reasonable when your needs are narrow. A minimal gateway is a small internal service with a single call function, a token bucket per tenant in Redis, a retry-and-fallback wrapper, and a logging hook. Start there and adopt a heavier tool when the routing and caching logic outgrows a few hundred lines. What you should not do is leave the concerns scattered, because that is the state a gateway exists to fix. For the broader set of decisions around getting AI features to production, our AI engineering playbook covers where this layer sits in the stack.
When you do not need one yet
A gateway is real infrastructure, and adopting it too early is its own mistake. If you have one AI feature, one provider, low traffic, and a single tenant class, the SDK is fine. The signal that it is time is plural: a second feature, a second provider, per-customer usage that varies wildly, a cost bill nobody can break down, or a 429 rate you cannot explain. When two of those are true, the scattered calls have already started costing more than the gateway would.
Frequently asked questions
Is an AI gateway the same as an API gateway?
No. A traditional API gateway routes HTTP traffic and handles auth and general rate limiting. An AI gateway is specialized for model calls: it understands tokens versus requests, does semantic caching on prompts, routes between models by task, and attributes cost per call. You can run one behind the other, but they solve different problems.
Does a gateway add latency?
A thin gateway adds a small hop, typically single-digit milliseconds inside your own network, which is negligible next to the hundreds of milliseconds a model call takes. In practice it usually reduces latency, because the cache serves repeats instantly and routing sends simple requests to faster, smaller models.
Where do per-tenant budgets come from?
You divide the account's provider limit across your customers by priority and plan, keeping headroom for bursts. The gateway enforces the split with a token bucket per tenant so no single customer can consume the shared ceiling. This is the same isolation discipline behind multi-tenant data separation in SaaS, applied to rate rather than data.
Do I still need retries if I have a gateway?
Yes, but they live in the gateway now, with jitter and a circuit breaker, and they run once for every caller instead of being reimplemented per feature. Centralizing them is the point, because a retry policy that differs across call sites is how a 429 becomes a storm.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.