The invoice landed on a Tuesday. $48,211. That was the OpenAI line item for a single month — and it had grown 34% in the previous quarter. The product team had shipped three new AI features. The CEO called it a success. The CFO called it a problem. They were both right.
This is the story of what happened next: how a B2B SaaS company making $2.1M ARR found itself spending nearly $580K annualized on LLM inference, and how we — the engineers brought in to fix it — got that number to $19,400 in six weeks without a single feature rollback. Everything here is real. The numbers are real. The tradeoffs are real. If you are staring at a bill that feels out of control, read every section.
Why the Bill Got So Big
Before we optimized anything, we audited everything. Most teams skip this step and jump straight to "use a cheaper model." That is how you create new bugs while not actually solving the problem.
The company was running five LLM-powered features: a customer support chatbot, a contract summarization pipeline, an internal Q&A tool over documentation, a proposal draft generator, and an email classification system. All five were pointed at GPT-4o. All five used full-context prompts. None used caching. None used routing. Each feature had been built independently by different engineers, and nobody had ever sat down and asked: "Does this task actually need a frontier model?"
The three core cost drivers were token bloat, no caching, and single-model over-reliance — and they compounded each other. A single contract summarization call was burning 14,000 input tokens because the full contract was injected into the prompt every time, including the same 3,200-token system prompt repeated on every request. At GPT-4o pricing, that was $0.042 per call. At 80,000 calls per month for that one feature alone, you get to $3,360 monthly — from one feature running one model on one task.
Multiply this pattern across five features and you have a $48K invoice.
Step 1 — We Measured Before We Touched Anything
The first week was pure observability, no code changes. We instrumented every LLM call with token counts, model used, latency, and task type. We used LangSmith for tracing and pushed the data into a simple Postgres table we queried daily.
Three numbers came out of that week that changed how the team thought about the problem:
- 62% of all tokens were being spent on three tasks: contract summarization, documentation Q&A, and proposal generation
- 41% of all calls were near-duplicate requests — the same or very similar inputs being sent within a 24-hour window
- The email classification feature was using GPT-4o for binary "urgent / not urgent" decisions — a task that a $0.00015/1K-token model handles at 98% accuracy
Here is the lightweight tracing wrapper we used to tag every outbound API call:
import time, tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def traced_completion(client, messages, model, task_type):
"""Wrap every LLM call with token + latency tracking."""
input_tokens = sum(len(enc.encode(m["content"])) for m in messages)
start = time.monotonic()
response = client.chat.completions.create(
model=model,
messages=messages
)
elapsed_ms = (time.monotonic() - start) * 1000
output_tokens = response.usage.completion_tokens
# Log to your observability table
log_llm_call(
model=model,
task_type=task_type,
input_tokens=input_tokens,
output_tokens=output_tokens,
latency_ms=round(elapsed_ms),
cached_tokens=response.usage.get("cached_tokens", 0)
)
return response
Notice we tag task_type on every call. Without this label, your cost data is a single number — useless for deciding which feature to optimize first. With it, you get a per-feature cost breakdown in a single SQL query. This took 20 minutes to implement and it drove every decision that followed.
You cannot optimize what you cannot see. This week of measurement paid back in hours once we started making changes. If your team is considering any of the strategies below, build your instrumentation first. You will make better decisions with data than with intuition.
Step 2 — Prompt Caching Cut 38% of the Bill Overnight
This was the fastest win, and it required almost no engineering. Both Anthropic and OpenAI now offer automatic prompt caching — cached input tokens on OpenAI cost 50% less, and on Anthropic's Claude, cached tokens cost 90% less than regular input tokens. Caching activates automatically on OpenAI for prompts over a minimum threshold, with no code changes needed.
The company's system prompts were between 2,000 and 4,500 tokens each. Moving the static content — instructions, context, examples — to the beginning of every prompt and keeping the variable content at the end is all it takes to trigger caching. We audited each prompt, restructured the static prefix, and redeployed.
Here is the before-and-after prompt structure that made this work:
# BEFORE: Variable content mixed everywhere — no caching
messages = [
{"role": "system", "content": f"Summarize this contract for {client_name}. "
f"Focus on: payment terms, liability, termination. "
f"Contract text: {contract_text}"}
]
# AFTER: Static prefix first, variable content last — cache hits
messages = [
{"role": "system", "content": (
"You are a contract analysis assistant. "
"Extract and summarize: payment terms, liability clauses, "
"termination conditions, renewal terms, and indemnification. "
"Output as structured JSON with these exact keys: "
"payment_terms, liability, termination, renewal, indemnification. "
"Keep each summary under 100 words."
)},
{"role": "user", "content": f"Contract for {client_name}:\n\n{contract_text}"}
]
The system message is now a stable prefix that caches after the first request. Within 48 hours, we could see in LangSmith that cached token reads were accounting for 55–70% of all input tokens across the five features.
That single change took the monthly bill from $48,211 to roughly $30,000. It took two engineers one day. The math is that simple: if your prompts repeat static content on every call, you are paying full price for content the API has already processed. Prompt caching is not a trick — it is the baseline you should be at before doing anything else.
Step 3 — Model Routing Dropped Another $7K Per Month
With caching in place and real task-level data in hand, we built a two-tier routing layer. The idea is straightforward: not every task needs the most capable model. Complexity-based routing achieves 10–30% cost reduction while maintaining accuracy at the simple tier, and cascading approaches where 80% of queries resolve at the budget tier can cut costs by 65–80%.
We classified our five features into three tiers:
| Feature | Old Model | New Model | Reason |
|---|---|---|---|
| Email classification | GPT-4o | GPT-4o-mini | Binary task, high volume |
| Documentation Q&A | GPT-4o | Claude Haiku | Short, factual retrieval |
| Contract summarization | GPT-4o | GPT-4o | Stays — needs reasoning |
| Proposal draft generator | GPT-4o | GPT-4o | Stays — nuanced output |
| Customer support chatbot | GPT-4o | Tiered router | 68% FAQ → mini, rest → 4o |
The customer support chatbot was the most interesting case. Roughly 68% of support queries were FAQ-type: "How do I reset my password?", "Where is my invoice?", "What's the refund policy?" These did not need GPT-4o. We built a simple classifier that routes FAQ-pattern queries to GPT-4o-mini and escalates anything scored as complex or edge-case to GPT-4o.
Here is the routing logic — it is embarrassingly simple:
from openai import OpenAI
client = OpenAI()
CLASSIFIER_PROMPT = (
"Classify this support query as SIMPLE or COMPLEX. "
"SIMPLE = password reset, invoice lookup, refund policy, "
"account settings, billing FAQ. "
"COMPLEX = bug report, integration issue, feature request, "
"data migration, custom config. "
"Respond with one word only."
)
def route_query(user_query: str) -> str:
"""Route to cheap or expensive model based on complexity."""
classification = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": CLASSIFIER_PROMPT},
{"role": "user", "content": user_query}
],
max_tokens=5
)
label = classification.choices[0].message.content.strip()
return "gpt-4o-mini" if label == "SIMPLE" else "gpt-4o"
The classifier itself runs on GPT-4o-mini — the cost of classification is negligible compared to the savings on inference. GPT-4o-mini costs roughly 15–20x less per token than GPT-4o on comparable tasks. Shifting 68% of chatbot volume to the mini tier cut chatbot costs by 62%. Combined with the documentation Q&A migration to Haiku, this step removed another $7,200 from the monthly bill.
If this is research for a task on your roadmap — we ship features like this in 5–7 days.
See pricing →Step 4 — Semantic Caching Handled the Duplicate Problem
Remember that 41% of all calls were near-duplicate requests. Standard HTTP caching handles exact duplicates, but LLM queries are text — "What's your return policy?" and "Can I return this?" are semantically identical but string-different. Semantic caching solves this by embedding incoming queries, comparing them against a vector store of recent queries, and returning cached responses for matches above a similarity threshold.
We implemented this using Redis with a vector similarity layer. Here is the core logic:
import numpy as np
from openai import OpenAI
from redis import Redis
client = OpenAI()
redis = Redis(host="localhost", port=6379, db=0)
SIMILARITY_THRESHOLD = 0.92
def get_embedding(text: str) -> list[float]:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return resp.data[0].embedding
def cosine_sim(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_cache_lookup(query: str) -> str | None:
"""Check if a semantically similar query was already answered."""
query_emb = get_embedding(query)
cached_keys = redis.keys("sem_cache:*")
for key in cached_keys:
cached = json.loads(redis.get(key))
sim = cosine_sim(query_emb, cached["embedding"])
if sim >= SIMILARITY_THRESHOLD:
return cached["response"]
return None
Queries are embedded with a lightweight model, compared against cached embeddings, and if the cosine similarity exceeds 0.92, the cached response is returned. Calls that hit the semantic cache cost zero tokens on the LLM.
The cache hit rate stabilized at 34% after two weeks of warm-up, meaning one in three LLM calls was being served from cache entirely. On the support chatbot alone — which generates the highest query volume — this cut monthly costs by another $2,100. The infrastructure cost of running the semantic cache (Redis, embedding model) was $180/month.
The email classifier was using GPT-4o to decide between two labels. We replaced it with a fine-tuned model at 1/40th the cost. That is not optimization — that is fixing a mistake.
Step 5 — Context Window Trimming for the RAG Pipeline
The documentation Q&A feature used a RAG architecture, but it was poorly tuned. It was retrieving the top-10 document chunks per query and injecting all 10 into the prompt context — regardless of relevance. Average prompt size for a Q&A call was 9,400 tokens. After auditing 500 logged responses, we found that the correct answer came from chunk 1 or 2 in 87% of cases. Chunks 6–10 were essentially noise.
We added a reranking step using a lightweight cross-encoder model to reorder retrieved chunks by relevance, then hard-capped injection at the top 3 chunks. Here is the reranker integration:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
TOP_K = 3 # Hard cap — was 10 before
def rerank_and_trim(query: str, chunks: list[str]) -> list[str]:
"""Rerank retrieved chunks and keep only top-K."""
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(
zip(scores, chunks),
key=lambda x: x[0],
reverse=True
)
return [chunk for _, chunk in ranked[:TOP_K]]
Average prompt size dropped from 9,400 to 3,100 tokens — a 67% reduction in input tokens for that pipeline. Combined with the model switch to Claude Haiku, this moved the documentation Q&A feature from $4,800/month to $890/month. Accuracy on our held-out test set was unchanged — because the discarded chunks were not contributing to correct answers in the first place.
Prompt compression is consistently one of the highest-leverage techniques available, with documented savings of 10–30% from context reduction alone. When you stack it with a model swap and caching, the compounding effect is significant.
What to Do This Week
Six weeks after our first commit, the bill read $19,400. From $48,211 to $19,400 — a 60% reduction, $344K saved annually, and not a single user-facing regression. Here is the sequenced playbook:
- Instrument first. Add token-level tracing before you change anything. LangSmith, Helicone, and Weights & Biases all work. Without this, you are guessing.
- Enable prompt caching immediately. Restructure system prompts so static content leads. This is a one-day change with a 30–50% impact on bills for prompt-heavy workloads.
- Audit model-to-task fit. Pull your task list. Ask: does each task actually require a frontier model? Email classification, simple Q&A, entity extraction, and intent labeling almost never do.
- Build a two-tier router. A lightweight classifier that sends easy queries to a cheap model and hard queries to a capable one consistently achieves 60–80% cost reduction on high-volume chatbot features.
- Add semantic caching. For any feature with repetitive user queries, a vector similarity cache with a 0.92+ threshold will hit 25–40% of requests after two weeks of warm-up.
- Trim RAG context windows. Add a reranking step. Cap injected chunks at 3–5. Measure accuracy before and after — in our case it was unchanged, because the discarded chunks were not contributing.
One honest tradeoff to name: semantic caching introduces a cache invalidation problem. When your product changes — new pricing, new features, changed policies — you need to flush the relevant cache segments. Build a cache invalidation API on day one, before you need it.
The bill you are staring at is not a tax on using AI. It is a tax on not yet having built the optimization layer. The techniques above are not experimental — they are standard production practice in 2026, and the savings are repeatable.
Got an AI feature with a cost problem?
Book a free 20-minute AI Feature Scoping Call. We'll tell you whether Boundev is the right fit, what tier you'd need, and how fast we can ship. We say no to about a third of calls — the fit either works or it doesn't.
Book scoping call →Frequently Asked Questions
How quickly can I expect results after enabling prompt caching?
Results are visible within 24–48 hours. Both OpenAI and Anthropic activate caching automatically once your prompt structure places static content at the prefix — no API changes required. Most teams see a 30–50% reduction in input token costs within the first two days.
Will routing queries to cheaper models hurt output quality?
For the right tasks, no. Classification, intent detection, simple Q&A, and entity extraction tasks show minimal quality degradation when moved to models like GPT-4o-mini or Claude Haiku. The risk is over-routing — sending reasoning-heavy or creative tasks to budget models. Always benchmark accuracy on a held-out test set before routing goes live.
What is semantic caching and how is it different from standard caching?
Standard caching returns a hit only on exact string matches. Semantic caching embeds queries into a vector space and returns a hit when an incoming query is sufficiently similar (typically 0.92+ cosine similarity) to a previously answered one. This handles natural language variation — "refund policy" and "how do I get my money back?" resolve to the same cached answer.
At what usage volume does self-hosting become worth considering?
Self-hosting open-source models (Llama 3, Mistral) typically breaks even against API pricing at sustained volumes above 2–3 million tokens per day, depending on GPU type and model size. Below that threshold, the infrastructure overhead — GPU provisioning, model serving, monitoring — costs more than the API savings.
Do these optimizations require significant engineering resources?
Prompt caching requires one day of restructuring. Model routing requires 1–2 weeks including classifier and routing logic. Semantic caching requires 1–2 weeks including Redis and embedding integration. RAG trimming with reranking is another 1–2 weeks. The full stack described here took two engineers six weeks — and delivered $344K in annualized savings.
