Cut LLM API costs with prompt caching: the real math
Most LLM bills are paying full price to send the same tokens over and over. The system prompt, the tool definitions, the few-shot examples, the retrieved context that barely changes between turns: every request re-bills all of it at the standard input rate. Prompt caching fixes that, and the savings are large enough to change the unit economics of a feature. The catch is that it only helps if your prompts are structured for it, and it can quietly cost more if they are not.
This is a practitioner walkthrough: what caching actually charges, the break-even math, and the three structural mistakes that turn a cache hit into a cache miss.
What prompt caching charges, with current numbers
Two providers, two models, same idea: the provider stores the computed prefix of your prompt so the next call that shares that prefix skips recomputation and bills the reused tokens at a steep discount.
Anthropic charges for it explicitly. A cache write costs 1.25x the normal input rate for a 5-minute time-to-live, or 2x for a 1-hour TTL. A cache read costs 90 percent less than the normal input rate. On Claude Sonnet, that means cache reads run about $0.30 per million tokens instead of $3, and on Haiku about $0.10 instead of $1. You break even after a single cache hit on the 5-minute tier, or two hits on the 1-hour tier.
OpenAI applies caching automatically with no configuration on its recent models. Prompts longer than 1,024 tokens get a 50 percent discount on the cached prefix, and there is no write surcharge. The cache holds for roughly 5 to 10 minutes of inactivity.
The practical difference: Anthropic gives a deeper discount (around 90 percent) but charges to write the cache, so it rewards deliberate, long-lived prefixes. OpenAI gives a shallower discount (50 percent) for free, so it rewards any repeated prefix without you thinking about it.
The break-even math that actually matters
Caching is not free money. The write surcharge means a prefix you cache once and never reuse costs you more than not caching at all. The question is always: how many reads per write.
Take a 4,000-token system prompt plus tool schema that you reuse across a conversation. On Anthropic's 5-minute tier, the first call pays 1.25x to write it; every following call within the window pays 0.1x to read it. If a typical session sends five requests against that prefix, you pay 1.25x once and 0.1x four times, versus 5x at full price. That is roughly 1.65x total against 5x: about a 67 percent reduction on the cached portion. For high-repetition workloads like a code assistant or a RAG chat where the retrieved context is stable across follow-up questions, real-world reductions of 80 to 95 percent on input cost are normal. We walk through one of those before-and-after numbers in how we took one client's LLM bill from $48k to $19k a month.
The rule of thumb: cache anything you will reuse at least twice inside the TTL window, and do not cache anything you touch once. If your traffic is bursty and unpredictable, the 1-hour TTL is worth the higher write cost because it keeps the prefix warm between bursts.
Three mistakes that turn a hit into a miss
Caching keys on an exact prefix match. The provider caches the longest stable beginning of your prompt and stops at the first byte that changes. That single fact explains almost every disappointing cache-hit rate.
Putting variable content near the top
If your prompt opens with a timestamp, a request ID, a user name, or a freshly shuffled set of examples, the cache breaks on the very first token and nothing downstream is reused. Order your prompt static-first: system instructions, tool definitions, and stable context at the top; the user's actual question and any per-request variables at the very bottom.
Letting retrieval reorder itself
RAG pipelines often re-rank or re-sort retrieved chunks per request. Even when the same documents come back, a different order is a different prefix, so the cache misses. Sort retrieved context deterministically (by document ID, not by score) for the portion you intend to cache, and append the score-sensitive material after the cache boundary.
Ignoring the TTL window
A 5-minute TTL is generous for an active chat and useless for a nightly batch job that touches each prefix once an hour. Match the tier to the access pattern. For background or scheduled work, the 1-hour TTL or a different optimization entirely (batch APIs, smaller models) usually wins.
Where caching is not the answer
Caching reduces input cost. It does nothing for output cost, which is often the larger line item for generation-heavy features. If your prompts are short and your completions are long, prompt caching is a rounding error and you should look at model routing, output-length controls, or a cheaper model for the easy requests instead.
It also will not rescue an architecture that re-embeds and re-retrieves on every keystroke, or one that sends the entire conversation history uncompressed on turn forty. Those are design problems that caching masks rather than solves. We see this constantly when we audit AI features; the infrastructure choices underneath matter more than any single discount, which is the theme of our breakdown of where AI infrastructure costs actually come from.
Used well, prompt caching is one of the highest-leverage changes you can make to a shipped LLM feature: a prompt reorder and a TTL choice, no model swap, no quality tradeoff. It is usually the first thing we reach for in the production AI features we build for SaaS teams.
Frequently asked questions
Does prompt caching change the model's output?
No. Caching reuses the computed representation of identical input tokens. The model produces the same output it would have without the cache. It only affects cost and latency, not quality.
How much can prompt caching realistically save?
On the cached portion of the input, expect 50 percent with OpenAI's automatic caching and up to 90 percent per read with Anthropic, netting 80 to 95 percent reductions on input cost for high-repetition workloads once the write cost is amortized. It does not reduce output cost.
Do I need to change my code to use it?
With OpenAI it is automatic above 1,024 tokens. With Anthropic you mark the cache breakpoints explicitly and choose a TTL, then order your prompt static-first so the stable prefix is large. The code change is small; the prompt-structure change is the real work.
When should I not bother with caching?
When prompts are short, when each prefix is used only once inside the TTL, or when output tokens dominate your bill. In those cases, model routing or a smaller model is a better lever. If you want a second opinion on which lever fits your workload, our RAG integration cost breakdown covers the tradeoffs, or talk to the senior LLM engineers who ship these systems.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.