← Back to writing

Self-hosting an LLM is cheaper than the API far later than you think

The pitch for self-hosting an open model is simple: stop renting tokens at a markup and run your own GPU. Open models like Llama, Mistral, and DeepSeek now match or beat the closed frontier models on most benchmarks, so the quality argument that used to end this conversation is gone. That makes the question a pure cost question, and the honest answer is that self-hosting breaks even much later than the GPU price tag suggests.

The short version:

  • Against premium APIs, a single consumer or workstation GPU can break even somewhere around 5 to 30 million tokens a month, depending on the model and hardware.
  • Against budget APIs, you often need 50 to 100 million tokens a month before self-hosting wins, and frequently it never does.
  • Self-hosting costs 3 to 5 times the raw GPU rental price once you add engineering time, monitoring, and idle capacity.
  • For most SaaS teams in 2026, the break-even against frontier models sits near 100 to 256 million tokens a month, a level most products never reach.

Why the GPU price is the smallest number

When people compare self-hosting to an API, they usually put a GPU rental rate next to a per-token price and stop there. That comparison is missing most of the cost. The GPU is the visible line item; the invisible ones are larger.

Engineering time is the real bill

A self-hosted deployment needs someone to run it. Realistic estimates put ongoing maintenance at 10 to 20 hours a month for monitoring, patching, model updates, and incident response. At 75 to 150 dollars an hour for a senior DevOps or ML engineer, that is 750 to 3,000 dollars a month in labor before you serve a single extra token. That number does not appear on any GPU invoice, which is exactly why self-hosting projects blow their budgets.

Infrastructure around the model

Running a model in production is more than a process on a box. A basic setup with a GPU node pool, autoscaling, metrics (Prometheus and Grafana), logging, and load balancing adds 200 to 500 dollars a month in infrastructure, plus meaningfully more engineering time to build and keep it working. You also pay for idle capacity: a GPU you provisioned for peak load sits underused off-peak, and you pay for it around the clock either way.

Stack these up and the rule of thumb is that self-hosting costs 3 to 5 times the raw GPU rental price. A 500-dollar-a-month GPU is really a 1,500 to 2,500-dollar-a-month system once it is a production service with an on-call owner.

Where the break-even actually lands

The break-even point depends entirely on which API you are comparing against, because API prices span a wide range.

Against premium frontier APIs, self-hosting looks attractive earliest. A single strong workstation GPU running a mid-size open model can break even against a premium tier at roughly 5 to 10 million tokens a month, and a cheaper card running a small model breaks even against a mini-tier model closer to 30 million tokens a month. Those are the optimistic scenarios that self-host calculators tend to show.

Against budget APIs, the picture flips. Economy-tier hosted models are priced so low that you need 50 to 100 million tokens a month to justify the operational overhead, and against the very cheapest options self-hosting is rarely worth it at all. When you fold in the 3-to-5x hidden cost multiplier, the break-even against current frontier models (the ones you would actually want to match on quality) sits near 100 to 256 million tokens a month. Most production SaaS features never generate that volume. If you want to run your own numbers, our savings calculator is a starting point, and the broader shape of these costs is covered in our note on AI infrastructure costs.

The reasons to self-host that are not about cost

Cost is usually the stated reason and rarely the real one. The genuinely good reasons to self-host are about control, not the bill:

Data residency and privacy. If your contracts or regulators require that customer data never leaves your infrastructure, an open model on your own hardware answers that in a way a hosted API cannot. This is the most common legitimate driver.

Latency and determinism. Self-hosting removes a network hop and lets you pin a model version so it does not change under you. If you have a hard latency budget, owning the stack can help, though a well-tuned API path is often fast enough. See our writeup on inference latency and time to first token for what actually moves that number.

Fine-tuning and specialization. If you have trained a model on proprietary data and want full control over serving it, self-hosting is the natural home. For a generic model answering generic prompts, it rarely pays.

A cheaper path most teams skip

Before you provision a GPU, exhaust the software levers, because they are faster to ship and reversible. Routing traffic to a cheaper model for easy queries, caching, and trimming output length usually cut the bill more than self-hosting would, without an on-call owner. We cover the routing pattern in most LLM queries do not need your most expensive model, and a full stack of these levers in cutting an LLM bill from 48k to 19k a month. If after that your volume is genuinely in the hundreds of millions of tokens a month and stable, self-hosting starts to make sense. The provider tradeoffs for the API path itself are in our Bedrock versus OpenAI comparison.

How we advise SaaS teams on this

When a team asks us whether to self-host, the first thing we do is get the real monthly token volume and the real reason. If the reason is cost and the volume is under about 50 million tokens a month, self-hosting will almost certainly cost more once labor is counted, and we say so. If the reason is data residency or a specific compliance requirement, the calculus changes and self-hosting can be the right call regardless of the token math. The mistake we see most is a team spending months building GPU infrastructure to save money on a workload that an afternoon of routing and caching would have handled.

Frequently asked questions

Is self-hosting an LLM actually cheaper than using an API?

Sometimes, but far later than the GPU price suggests. Once you add engineering time, monitoring, and idle capacity, self-hosting runs 3 to 5 times the raw GPU cost. Break-even against premium models can start around 5 to 30 million tokens a month, but against budget APIs you often need 50 to 100 million, and frequently the API stays cheaper.

Do open models match closed API models on quality?

In 2026, leading open models match or exceed earlier frontier models on most benchmarks, so the quality gap that used to rule out self-hosting has largely closed for common tasks. That is exactly why the decision is now mostly about total cost and control rather than capability.

What monthly token volume justifies self-hosting?

As a rough guide, self-hosting to save money starts to make sense in the hundreds of millions of tokens a month, near the 100 to 256 million range against frontier-quality models once hidden costs are included. Below roughly 50 million tokens a month, a managed API is usually cheaper.

What are the good non-cost reasons to self-host?

Data residency and privacy requirements, the need to pin a model version for determinism, hard latency budgets, and serving a model you fine-tuned on proprietary data. These can justify self-hosting even when the raw cost math favors an API.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.