Per-tenant LLM cost caps for multi-tenant SaaS
One tenant can wreck the economics of an AI feature. In a multi-tenant SaaS product, LLM spend is pooled by default: every customer calls the same endpoints, and your provider invoice is one lump sum. Then a single power user loops an agent over a 50,000-row export, or a customer wires your API into their own automation, and your gross margin on that account goes negative for the month. You find out when finance asks why the model bill doubled.
The fix is not a smarter model or a cheaper provider. It is a control plane: attribute every call to a tenant, meter it, and cap it before the spend happens. This walks through the architecture that keeps per-tenant LLM cost bounded and turns raw usage into something you can bill against.
- Every LLM call must carry a tenant_id, user_id, feature, and environment - including cron jobs and internal tools, not just customer-facing requests.
- Put a gateway in front of the provider so metering and limits live in one place, not scattered across services.
- A per-tenant daily spend cap is the single highest-leverage control: it turns an unbounded risk into a known worst case.
- The same usage ledger that enforces caps is what you later bill from, so build it once and use it for both.
Attribution comes first, and it has to be total
You cannot cap what you cannot measure, and you cannot measure per tenant unless every call is tagged. The rule is total coverage: each request to the model carries at least a tenant_id, a user_id, the feature name, and the environment. Not most calls. The 3 AM lead-scoring cron, the internal admin tool nobody opens on weekends, the retry path after a timeout - all of them. A single untagged code path becomes the blind spot where a runaway job hides, and it is always the one that blows the budget.
This is a stricter requirement than per-feature attribution. Knowing a feature costs $4,000 a month tells you what to optimize; knowing which tenant inside that feature spent $3,200 of it tells you which account to cap or reprice. We covered the feature view in LLM cost attribution per feature; per-tenant attribution is the same discipline pushed down to the customer grain. Tag at the call site, propagate the tenant context through your request pipeline, and treat an untagged model call as a bug, not a rounding error.
Put a gateway in the path
Scatter rate-limit and metering logic across a dozen services and it will drift out of sync within a quarter. Centralize it. An LLM gateway sits between your application and the providers, speaks the provider API on both sides, and gives you one place to record spend and enforce limits. It is the same architectural move we described in our AI gateway production architecture writeup, applied to cost governance.
Concretely, the gateway does four things on every request: resolve the tenant from the verified identity, look up that tenant's budget and current usage in a fast store, decide allow or deny, and after the response record the actual token usage and cost to a durable ledger. Open-source gateways handle the provider fan-out and can write spend records to Postgres or a callback; you supply the tenant policy. Because the gateway also owns provider routing, it composes cleanly with model routing to cut AI costs - route cheap traffic to a small model and expensive traffic to a large one, all under the same per-tenant budget.
Enforce with a fast quota store
Budget checks are on the hot path of every request, so they have to be fast. The standard pattern is a Redis-backed quota store keyed by tenant. Each tenant maps to a budget - tokens per day, requests per minute, a hard daily spend cap - and the gateway checks and increments counters in Redis before forwarding the call.
Soft limits versus hard caps
Distinguish two kinds of limit. A soft limit is a rate ceiling: when a tenant exceeds their requests-per-minute, return a 429 with a Retry-After header so well-behaved clients back off and retry. A hard cap is a spend ceiling: when a tenant hits their daily budget, return a 403 until the quota resets at midnight UTC. The soft limit smooths bursts; the hard cap bounds the worst case. A tenant on a $20 daily cap can cost you at most $20 that day no matter what their code does, and that guarantee is worth more than any single optimization.
Where to set the numbers
Derive caps from the plan, not from a guess. If a tier sells for $99 a month and you target a 70 percent gross margin on AI, roughly $30 of monthly model spend is your ceiling for that tenant, which is about $1 a day. Set the hard cap a little above expected usage so normal customers never hit it, and low enough that abuse is capped before it hurts. Alert when a tenant crosses, say, 80 percent of their daily budget so support can reach out before the 403 lands.
Meter once, bill from the same ledger
The usage records you write to enforce caps are also the raw material for pricing. Keep them in a store separate from operational data - an append-only usage ledger keyed by tenant, timestamp, feature, model, input tokens, output tokens, and computed cost. That ledger does double duty: it is the source of truth reconciled against the provider invoice at month end, and it is what usage-based or tiered pricing reads from when you move past flat plans.
Building this once avoids a painful migration later. Teams that bolt on metering only when they decide to charge for AI usage discover their historical data is unreconstructable because early calls were never tagged. If you are isolating tenant data anyway - and you should be - fold cost metering into that boundary; our note on multi-tenant RAG data isolation covers the surrounding tenancy concerns.
FAQ
Won't a hard cap break the product for a paying customer?
Only if the cap is set below legitimate usage, which is why you derive it from the plan and alert at 80 percent. For most tenants the cap is a safety net they never touch. For the rare account that hits it, a 403 with a clear message and an upgrade path is a far better outcome than silently eating a negative-margin month and discovering it weeks later.
Do I need a third-party gateway, or can I build my own?
Either works. A thin middleware in front of the provider SDK that resolves the tenant, checks Redis, forwards the call, and writes a usage row is a few hundred lines. An off-the-shelf gateway saves you the provider fan-out and gives you spend logging out of the box. The architecture is the same regardless; what matters is that metering and enforcement live in one place.
How does this relate to caching and routing?
They stack. Attribution tells you which tenant spends; caps bound the risk; routing and prompt caching lower the cost of the calls that do go through. Attribution comes first, because you cannot aim caching, routing, or downgrades correctly until you can answer which tenant spent the money.
What granularity should caps use - per tenant, per user, or per feature?
Start per tenant, because that is the unit you bill and the unit that carries margin risk. Add per-user limits inside a tenant if a single seat can run up the tenant's bill, and per-feature budgets if one experimental feature needs a separate ceiling. The quota store handles all three the same way; begin with the tenant grain and refine only where a real incident justifies it.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.