Semantic caching: cut LLM cost on repeat questions
Most production LLM features answer the same handful of questions over and over, phrased a hundred slightly different ways. Without a cache, every rephrasing is a fresh, full-price model call. Semantic caching reuses a stored answer when a new query is close enough in meaning to one you have already served. Done well, it cuts inference cost 30 to 70 percent and removes latency on the hits entirely.
This is not the same thing as the prompt caching your provider offers, and conflating the two is why teams either skip semantic caching or implement it unsafely. This post draws the line and gives the production guardrails.
Semantic caching is not prompt caching
Provider-side prompt caching reuses the computed prefix of a prompt, the long static system message and context you send on every call, so you are billed less for those input tokens. It is keyed on an exact prefix match and it still calls the model. We cover it separately in how prompt caching cuts LLM cost.
Semantic caching is different. It stores the full response keyed by the meaning of the user query, and on a hit it skips the model call entirely. Where prompt caching trims the input bill, semantic caching can eliminate the request. The two stack: cache the prefix on misses, skip the call on hits.
How it works in production
The mechanism is short. Embed the incoming query. Look it up in a vector store against previously seen queries. If the nearest neighbor is within a similarity threshold, return its cached answer. Otherwise call the model and write the new query, embedding, and response back to the cache.
The vector store
An in-memory store like Redis with vector search is the common choice because cache lookups must be faster and cheaper than the model call they replace. If a lookup costs more than a small-model inference, the cache is pointless. Your pick of store and index matters here; we compared the options in pgvector vs Pinecone vs Qdrant.
The similarity threshold
This is the dial that decides whether the cache helps or hurts. A cosine similarity floor around 0.8 is a frequently used starting point: above it, you serve the cached answer; below it, you call the model. Set it too low and you return a confidently wrong answer to a question that only looked similar. Set it too high and your hit rate collapses and you are back to paying full price. The threshold is workload-specific and must be tuned against real traffic, not guessed.
The numbers that justify it
Published production deployments report cache hit rates between 61.6 and 68.8 percent across query categories, which is why the cost reduction lands in the 30 to 70 percent range. On the research side, SCALM, a semantic-cache design, reports a 63 percent relative increase in hit ratio and a 77 percent reduction in token usage compared with the original GPTCache approach, largely by being smarter about what to cache and how to match it.
The other half of the payoff is latency. A cache hit returns in the time it takes to embed and look up a query, often tens of milliseconds, instead of waiting on full generation. For interactive features that is the difference between instant and sluggish. We dig into the response-time side in cutting LLM inference latency and time to first token.
Where it goes wrong
Semantic caching has sharp edges that a naive implementation ignores.
A fixed similarity threshold does not generalize across all query types, so two questions that embed close can still need different answers, especially when small details such as a date, an account, or a negation flip the intent. The fix is to scope the cache by user and context, never serve personalized or permission-sensitive answers from a shared cache, and fine-tune or pick an embedding model suited to your domain, since general-purpose embeddings make weaker cache-match decisions than specialized ones.
You also need an eviction and freshness policy. A cached answer about pricing, availability, or anything time-sensitive goes stale, and serving stale answers erodes trust faster than a slow one. Add a time-to-live and invalidate on the events that change the underlying truth.
The general rule: semantic caching belongs in front of repetitive, non-personalized, slow-changing queries. It does not belong in front of anything where a near-match must not be treated as an exact match. Knowing which requests fall on which side is the engineering judgment that makes it safe, and it pairs naturally with model routing to cut AI costs: cheap requests to a small model, repeat requests to the cache, hard requests to a frontier model.
If you want this built and tuned against your real traffic rather than a demo, it is a well-scoped task for our team. See what we build and the subscription pricing.
Frequently asked questions
How much can semantic caching save?
Reported production deployments cut LLM inference cost 30 to 70 percent, driven by cache hit rates in the 60 to 69 percent range. The exact saving depends on how repetitive your queries are and how well your similarity threshold is tuned.
Is semantic caching the same as my provider's prompt caching?
No. Prompt caching reuses a static prompt prefix to lower input-token cost but still calls the model. Semantic caching matches the meaning of a query and skips the model call entirely on a hit. They are complementary and can be used together.
What similarity threshold should I use?
A cosine similarity floor near 0.8 is a common starting point, but it is workload-specific. Too low returns wrong answers to merely similar questions; too high kills your hit rate. Tune it against real traffic and watch for cases where a small wording change flips the intended answer.
When should I not use semantic caching?
Avoid it for personalized, permission-sensitive, or fast-changing answers where a near match must not be treated as an exact match. Scope caches by user and context, set a time-to-live, and invalidate on events that change the underlying data.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.