← Back to writing

How to ship an AI feature without wrecking your gross margin

Shipping an AI feature is the easy part. Keeping it from quietly eating your gross margin is the part most SaaS teams discover one cloud bill too late. A traditional SaaS product runs at 75 to 85 percent gross margin because serving one more user costs almost nothing. An LLM feature breaks that math: every request has a real, variable cost, and a few power users can turn a profitable seat into a loss.

The good news is that margin is something you engineer, not just something you price. Below is how we think about protecting the unit economics of an AI feature before it reaches a single customer, and how to get back to a healthy margin if it is already live.

What an AI feature does to your gross margin

Classic SaaS cost of goods sold is mostly hosting and support, and it barely moves when usage doubles. Inference is different. Each call to a model has a token cost that scales linearly with how much your users use the feature. The more successful the feature, the bigger the bill.

Walk through a plausible example. Say your support copilot answers a ticket by sending the ticket plus retrieved documentation to a frontier model: about 12,000 input tokens and 600 output tokens per answer. On a current frontier model that is roughly four cents per answer. A customer on a 99 dollar plan who fires 400 answers a month costs you about 16 dollars in inference alone. That is 16 percent of that seat's revenue gone before you account for hosting, the vector database, support, or your own time.

Now look at the heavy user. Someone running 2,000 answers a month costs 80 dollars on a 99 dollar seat. After everything else, that customer is margin negative, and they are usually your most engaged account. Average it across a base where 10 percent of users drive most of the volume and you can watch a feature pull blended gross margin from the low 80s toward the low 60s. That is the compression the whole industry is feeling right now, and it does not show up in a demo.

Measure cost per request before you price anything

You cannot defend a margin you cannot see. The single highest-leverage thing to build first is cost attribution: for every model call, log the model used, input and output token counts, the computed dollar cost, the customer or workspace ID, and which feature triggered it. Write it to the same place you keep product analytics, not just a cloud billing dashboard.

With that in place you can answer the questions that actually run the business: what does the median answer cost, what does the 95th-percentile answer cost, which ten accounts generate half your inference spend, and what is cost as a percentage of revenue per plan tier. A cloud bill tells you that you spent money. Per-request attribution tells you who spent it and on what, which is the only view that lets you fix it. If you want a back-of-envelope starting point before you instrument anything, our AI cost calculator gets you in the right order of magnitude.

This is also where good observability earns its keep. The same trace that tells you whether an answer was correct should carry its cost, so quality and spend live on one timeline. We wrote about why observability and evals answer different questions, and cost is the third axis that belongs on the same trace.

Engineer the cost down before you engineer the price up

Most teams reach for a price increase first. That is backwards. There is usually 40 to 70 percent of inference cost to remove through architecture, and every dollar you remove is a dollar of margin you keep without a single customer conversation.

Route each request to the cheapest model that can answer it

Not every request needs your most expensive model. Classification, short factual lookups, and formatting can run on a small, cheap model; only genuinely hard reasoning needs the frontier tier. A routing layer that picks the model per request, rather than sending everything to the top model by default, is often the largest single saving. We covered the mechanics in routing requests to the cheapest capable model.

Cache the parts that repeat

An AI feature sends the same system prompt and the same reference documents thousands of times a day. Prompt caching on that static prefix cuts the input cost of every call that reuses it, and providers now discount cached input tokens heavily. On top of that, semantic caching returns a stored answer when a new question is close enough to one you have already answered. Together they remove a large slice of redundant spend. The details are in our notes on prompt caching for LLM cost savings and semantic caching.

Trim the context you send

Token cost is mostly input cost, and input is mostly the documents you retrieve. Retrieving the top 20 chunks when 5 would answer the question triples your input bill for no quality gain. Tighter retrieval and reranking pay for themselves twice: lower cost and, often, better answers because the model is not buried in noise.

Put a token budget and a circuit breaker on every feature

Cost optimization lowers the average. Budgets protect you from the tail. Give each feature, and ideally each customer, a token budget for the billing period. When an account approaches its budget, degrade gracefully rather than letting the bill run: switch them to a cheaper model, shorten retrieved context, queue non-urgent requests, or show a soft limit message. A circuit breaker that trips on anomalous spend is the difference between a surprising invoice and a runaway one when a customer scripts your API or a prompt loop misfires.

None of this is hostile to users. The 99-percent case never hits a limit. The control exists for the runaway 1 percent that would otherwise turn a healthy feature into a liability, and for the bad actor who finds your endpoint.

Price the feature off your real unit cost

Once you know your true cost per request and you have driven it down, pricing becomes a decision instead of a guess. The trap is flat, unlimited AI inside a per-seat plan: it rewards your heaviest users for costing you the most and gives you no lever when usage spikes. A few patterns that hold margin:

  • Include a generous allowance in the plan, then meter overage, so light users feel unlimited and heavy users cover their own cost.
  • Tie a usage-based or outcome-based component to the action the customer values, such as a resolved ticket or a generated report, so revenue scales with the cost that drives it.
  • Gate the most expensive capabilities behind a higher tier rather than giving every plan your frontier model.

The shift across the market away from pure per-seat pricing toward consumption and outcome models is not a fad. It is the only way to keep variable cost and revenue moving in the same direction. But it only works if you know your unit cost first, which is why attribution comes before pricing.

A worked example: clawing back margin on a support copilot

Take the copilot from earlier, blended cost around four cents per answer and margin sliding toward the low 60s. A staged fix usually looks like this. Add per-request cost attribution and discover that two features and eight accounts drive 60 percent of spend. Route classification and simple lookups to a small model, which handles 55 percent of traffic at a tenth of the cost. Turn on prompt caching for the system prompt and the docs prefix. Tighten retrieval from 20 chunks to 6. Cap each account at a monthly token budget with graceful degradation.

The cost per answer falls from about four cents to between one and two, and the feature lands back in the mid-70s on margin without raising a price or shipping a worse answer. The work that made the difference was not a clever prompt. It was treating cost as a first-class engineering requirement, the same way you treat latency or correctness. For the longer view on what these features cost once they are in production, see our breakdown of the real cost of maintaining AI products.

Frequently asked questions

How much does an AI feature really lower SaaS gross margin?

It depends entirely on usage and architecture, but inference can consume a meaningful double-digit percentage of the revenue from an AI feature, enough to pull a product from the high 70s into the 60s if it is unoptimized and priced as flat per-seat. The variance between a tuned and an untuned implementation is large, which is why measurement comes first.

Should I optimize cost or raise prices first?

Optimize first. There is usually significant inference cost to remove through routing, caching, and tighter context, and removing it improves margin without any customer friction. Once you know your real unit cost, you can price deliberately rather than padding for uncertainty you have not measured.

What should I instrument before launching an AI feature?

At minimum, log per-request model, token counts, computed cost, customer ID, and feature, written alongside your product analytics. That single change lets you see cost as a percentage of revenue per account and per tier, which is the view every later decision depends on.

Is usage-based pricing always better for AI features?

Not always, but a pure flat per-seat plan with unlimited AI is the riskiest option because cost and revenue move in opposite directions. A hybrid, a strong included allowance plus metered overage or an outcome-based component, keeps the simplicity customers like while protecting you from the heavy tail.

Protecting the margin on an AI feature is an engineering job that happens to have a finance outcome. Measure cost per request, drive it down in the architecture, cap the tail, and only then set a price that reflects what you actually know. Teams that do this ship AI features that grow revenue and margin together, instead of trading one for the other.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.