Your LLM bill is mostly output tokens, not input
Most teams trying to cut an LLM bill start by trimming the prompt. That is the wrong end. Across almost every model in 2026, output tokens cost several times more than input tokens, so a chatty model that pads every answer with restated context and caveats is where your money actually goes. Controlling output length is usually the fastest cost lever you have, and unlike caching or routing you can ship it in an afternoon.
The short version:
- Output tokens are priced roughly 4 times input on the median model, and up to 8 times on premium tiers. One frontier model charges 15 dollars per million input tokens and 75 per million output.
- Reasoning and thinking tokens are billed as output, so a verbose chain of thought is expensive even when the user never sees it.
- Length constraints in the prompt plus a max_tokens cap can cut output tokens 50 to 80 percent on structured tasks.
- Combined with caching and pruning, output control is part of stacks that cut total API cost by around 60 percent.
Why output is the expensive half
LLM pricing is asymmetric on purpose. Input tokens are processed in parallel during the prefill step, while output tokens are generated one at a time, each pass through the model producing a single token. That sequential generation is the costly part, and providers price it accordingly. The median output-to-input price ratio across leading models is about 4 to 1, and premium models push it to 8 to 1.
Put concrete numbers on it. If a model charges 15 dollars per million input tokens and 75 per million output, then a response that is 800 tokens long costs 5 times what the same number of input tokens would. A prompt you obsess over shaving from 1,200 to 1,000 tokens saves you a rounding error next to a response that runs 300 tokens longer than it needed to. The output side is where the leverage is.
The hidden output cost: reasoning tokens
Modern reasoning models generate a long internal chain of thought before the final answer, and those thinking tokens are billed as output even though the user usually never sees them. A model that spends 2,000 tokens reasoning to produce a 200-token answer bills you for 2,200 output tokens. If you are using a reasoning model for tasks that do not need deep reasoning, you are paying premium output rates for deliberation the task never required. This is one of the most common sources of a surprise bill in 2026, and it ties directly to model choice, which we cover in most LLM queries do not need your most expensive model.
The levers that actually cut output
Output control is a handful of concrete techniques, in rough order of impact.
Set a max_tokens cap
Without a hard cap, a model can run on well past the point where it answered the question, especially on open-ended prompts. Set max_tokens to a value slightly above the longest legitimate answer you expect. This does two things: it bounds your worst-case cost per call, and it protects your latency, since generation time scales with output length. Treat it as a safety rail, not the primary control, because a truncated answer is its own failure.
Ask for the length you want
The model will match the verbosity you signal. A prompt that says answer in at most three sentences, or return only the JSON with no preamble, reliably produces shorter output than one that does not. On structured tasks like classification and extraction, explicit length and format constraints have been shown to reduce output tokens by 50 to 80 percent. The instruction is nearly free and pays every single call.
Cut the preamble and the restatement
Left alone, many models open with a restatement of the question and close with a summary of what they just said. Neither adds value in a programmatic pipeline. Telling the model to skip the preamble and the closing summary removes output tokens on every response with no loss of substance.
Use structured output instead of prose
If your code parses the answer, ask for JSON or a compact schema rather than a paragraph the model then has to pad into readable prose. Structured responses are shorter and more reliable to parse, and they discourage the model from narrating. Pair this with a max_tokens cap sized to the schema.
Where output control fits in the bigger picture
Trimming output is the fastest lever, but it stacks with the others rather than replacing them. Caching removes repeated input cost: prompt caching discounts the static parts of your prompt, and semantic caching skips the model entirely for repeated questions. Routing sends easy queries to cheaper models. Output control makes each of those calls smaller. Teams that combine output length limits with caching and chain-of-thought pruning have reported cutting total API cost by around 60 percent, and we walked through one such stack in cutting an LLM bill from 48k to 19k a month.
There is a latency bonus too. Because output tokens are generated sequentially, a shorter response finishes faster. Trimming output improves both the bill and the perceived speed of the product, which is the opposite of most cost cuts that trade one for the other. Our note on inference latency and time to first token covers the speed side in detail.
How to find your own output waste
You cannot cut what you do not measure. Log input and output token counts per request for a week and look at the ratio. If output tokens are a large multiple of what the answer actually needed, or if a reasoning model is spending thousands of thinking tokens on trivial tasks, you have found the money. Set a max_tokens cap, add the length and format instructions to your prompts, and re-measure. On most pipelines the output token count drops immediately and answer quality is unchanged, because the tokens you removed were padding, not content.
Frequently asked questions
Why are output tokens more expensive than input tokens?
Input tokens are processed in parallel during prefill, while output tokens are generated one at a time, each requiring a full pass through the model. That sequential generation is the costly step, so providers price output several times higher, commonly around 4 times input and up to 8 times on premium models.
Do reasoning tokens count toward my bill?
Yes. The internal chain of thought a reasoning model produces before its final answer is billed as output, even though the user usually never sees it. Using a reasoning model for tasks that do not need it means paying premium output rates for deliberation the task never required.
Does setting max_tokens hurt answer quality?
Only if you set it too low and truncate real answers. Sized slightly above your longest legitimate response, max_tokens is a safety rail that bounds worst-case cost and latency without affecting normal answers. Pair it with prompt-level length instructions rather than relying on the cap alone.
How much can controlling output realistically save?
On structured tasks, explicit length and format constraints can cut output tokens by 50 to 80 percent. Because output is the expensive half of the bill, that translates into a large share of total cost, and combined with caching and pruning it is part of stacks that cut API spend by around 60 percent.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.