← Back to writing

Self-hosting an LLM vs an API: where the break-even is

The pitch for self-hosting an open model is simple: stop paying per token, buy or rent a GPU, and serve inference at cost. The math sometimes works. More often it does not, because the GPU bill is only a third of what self-hosting actually costs. Here is the break-even analysis we run before recommending one path or the other for a SaaS team.

Answer first: for most teams in 2026, an API is cheaper than self-hosting until you are pushing sustained, high volume. Below roughly 5 to 10 million tokens a month against a premium API, self-hosting almost never pays. The threshold moves much higher when you compare against cheap small-model APIs.

What a self-hosted endpoint really costs

Start with raw compute. Renting a single A100 80GB in the cloud runs about $2,000 to $3,500 a month at steady use. A purchased A100 amortizes to roughly $300 a month over a 36-month life, but only if you keep it busy. An idle GPU you reserved is pure loss, and most SaaS traffic is bursty, so utilization is rarely the 24/7 you modeled.

Now add the parts the GPU quote leaves out. A realistic deployment needs 10 to 20 engineering hours a month for patching, monitoring, model upgrades, and incident response. At $75 to $150 an hour for a senior infrastructure engineer, that is $750 to $3,000 a month in labor before a single extra token is served. As a rule of thumb, raw GPU spend is only 30 to 40 percent of the true cost; budget a 2.5x to 3x multiplier for the full picture. We break the same idea down in where AI infrastructure costs actually come from and the real cost of maintaining AI products.

What you can actually serve

Throughput decides whether the economics close. With vLLM on a single A100, a 70B model delivers somewhere around 1,000 to 3,000 tokens per second of aggregate output, depending on batch size, context length, and quantization. On 2x A100 80GB, you can hold roughly 40 to 60 tokens per second per request while serving 20-plus concurrent users. That is enough for a real product, but it means a $3,000-a-month rig has a hard ceiling. Cross it and you are buying a second node, not squeezing the first.

This is the number teams skip. Per-token API pricing scales linearly and elastically; your self-hosted box scales in lumpy, expensive steps. If your traffic doubles next quarter, the API just bills more, while the GPU plan needs a capacity project.

The break-even, in tokens per month

Put the two together and a clear band emerges.

Versus a premium API (frontier models)

Against a frontier-tier API, self-hosting a comparable open model tends to break even somewhere between 5 and 10 million tokens a month. Above that, owned or reserved GPUs start to win on marginal cost, assuming you keep them utilized. One commonly cited point: a Llama-class model on a $2-an-hour GPU breaks even against a $1.25-per-million-token API near 6.8 million tokens a month.

Versus a budget API (small models)

Against cheap small-model APIs, the picture inverts. Because per-token prices there are already $0.15 to $0.60 per million, you need 50 to 100 million tokens a month, and by some estimates sustained volume above roughly 500 million tokens a day, before self-hosting beats simply calling the cheap endpoint. Few SaaS features generate that load.

The practical read: if your workload is well served by a small model, do not self-host. If you genuinely need a frontier-class open model at sustained volume, model it carefully, because the threshold is reachable but the hidden costs decide it. Our $48k to $19k cost reduction writeup shows how far you can get on APIs alone before hardware enters the conversation.

When self-hosting is the right call anyway

Cost is not the only axis. Self-hosting earns its overhead when you have a hard data-residency or compliance requirement that rules out third-party APIs, when you need a fine-tuned model that no provider hosts, or when you must guarantee capacity that an API rate limit cannot. In those cases you are buying control, not savings, and the break-even math is a secondary concern.

For everyone else, the honest default in 2026 is to stay on an API, right-size the model, and reserve self-hosting for the specific workload that clears the volume bar. If you want a sober build-versus-buy estimate, our team runs this analysis as a scoped engagement. Start with the savings calculator, see what we build, or review the subscription pricing.

Frequently asked questions

At what volume does self-hosting an LLM beat an API?

Against premium frontier APIs, break-even is roughly 5 to 10 million tokens a month. Against cheap small-model APIs it is far higher, often 50 to 100 million tokens a month or more, because the per-token price you are competing with is already very low.

Why is the GPU cost not the whole cost?

Raw GPU spend is typically only 30 to 40 percent of the true cost of self-hosting. Engineering time for maintenance and monitoring adds $750 to $3,000 a month, and idle, under-utilized capacity wastes the reservation. Budget a 2.5x to 3x multiplier over the GPU quote.

How many tokens per second can one GPU serve?

With vLLM, a single A100 serving a 70B model produces roughly 1,000 to 3,000 tokens per second of aggregate output. On 2x A100 80GB you can sustain about 40 to 60 tokens per second per request across 20-plus concurrent users, which sets a hard ceiling per node.

When should I self-host despite the cost?

When you have strict data-residency or compliance rules, need a custom fine-tuned model no provider offers, or must guarantee capacity beyond API rate limits. In those cases you are paying for control rather than a lower bill.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.