Cut your LLM bill in half with the Batch API
If a chunk of your LLM spend goes to work that nobody is waiting on in real time, you are probably overpaying by exactly 50 percent. Both OpenAI and Anthropic price their batch endpoints at half the standard per-token rate, and most teams never route a single request through them. This is the rare cost lever that needs no prompt rewrite, no model downgrade, and no quality tradeoff.
Here is what the batch discount actually is, which of your workloads qualify today, and how to ship it without breaking the parts of your product that genuinely need a fast response.
What the batch discount really is
A batch (or message batches) API takes a file of requests, queues them, and returns the results within a completion window instead of streaming a reply back in a few hundred milliseconds. The window is up to 24 hours, though in practice most batches finish far sooner. In exchange for giving up immediacy, you pay half price on both input and output tokens.
The math is blunt. If you spend 5,000 dollars a month running a model synchronously and you move the eligible portion of that traffic to the batch endpoint, that portion drops to half its cost with the same model and the same prompts. No accuracy regression, because nothing about the inference changed except when it ran.
The discount also stacks. If your batched requests share a common prefix, the cached prefix earns both the batch discount and the prompt-caching discount on top of it. We covered the caching half of that equation in our breakdown of prompt caching for LLM cost savings, and the two combine cleanly.
Which of your workloads qualify
The test is simple: does a human need this answer in the next few seconds? If not, it can probably go through the batch endpoint. In most SaaS products, a surprising share of token spend fails the real-time test.
Offline evals and regression suites
Every time you run an eval set against a new prompt or model, those calls are pure offline work. Nobody is staring at a spinner. Running an eval harness through batch halves the cost of the experiment loop, which matters because a serious team runs evals constantly. If you have not built that loop yet, our piece on observability versus evals explains where it fits.
Bulk document and data processing
Summarizing a backlog of support tickets, classifying a CRM export, enriching a product catalog, generating descriptions across a dataset, or labeling training data are all batch-shaped. These pipelines are frequently the single largest line item in an LLM bill, and they are the easiest to move.
Scheduled and async generation
Nightly digests, weekly report generation, pre-computed recommendations, and content pipelines that publish on a schedule all run on a clock you control. If your job already runs at 2am, it does not care whether the model answers in 200ms or 20 minutes.
Industry estimates put this kind of deferrable work at 20 to 40 percent of total LLM spend for a typical application. Cutting that slice in half is a real number on the invoice, not a rounding error.
What stays on the synchronous path
Batch is the wrong tool for anything a user is actively waiting on. A chat reply, an inline autocomplete, a copilot suggestion, or a search result has a latency budget measured in hundreds of milliseconds, and a 24-hour completion window obviously violates it. For those flows, latency is the product, and you should optimize it directly rather than reach for the batch discount. Our notes on inference latency and time to first token cover that side.
The practical architecture is a split. Interactive traffic stays synchronous and gets latency tuning. Deferrable traffic gets routed to batch. The same model can serve both; only the endpoint and the dispatch logic differ.
How to ship it without a rewrite
Find the deferrable spend first
You cannot move what you cannot see. Before you write any batch code, tag your LLM calls by feature so you know which workloads are real-time and which are not. A request that runs in a cron job or a queue worker is a strong candidate; a request inside an HTTP handler that renders a user-facing response is not.
Build a dispatcher, not a refactor
The change is usually small. Collect deferrable requests into a file, submit the batch, poll for completion, and write results back to your datastore. The prompts and model parameters stay identical. You are changing the transport, not the inference, which is why this rarely touches your core application logic.
Handle the failure modes
Batches can partially fail, and individual requests inside a batch can error while others succeed. Your dispatcher needs to read per-request status, retry the failures, and avoid double-charging by re-submitting only what did not complete. Build idempotency into the write-back step so a retried batch does not duplicate rows.
Measure the before and after
Capture the per-feature cost before you move a workload and again after. This is the same discipline that took one of our engagements from a five-figure monthly bill down by more than half, documented in how we cut an LLM bill from 48k to 19k. Batch was one lever among several; the point is that each lever is only credible if you can show the number it moved.
Where batch fits in the wider cost stack
Batch is one of a handful of levers, and it composes with the others rather than competing with them. Routing cheaper traffic to cheaper models, which we cover in model routing to cut AI costs, decides which model handles a request. Batch decides when that request runs and at what rate. You can and should do both: route a bulk classification job to a small model and run it through the batch endpoint, and the discounts multiply.
The order we usually apply these in: tag spend by feature, route by model, cache shared prefixes, then batch everything deferrable. Each step is independent, each is measurable, and none of them degrade output quality.
Frequently asked questions
Does the batch API use a different or weaker model?
No. You select the same model you would call synchronously. The discount is for accepting a delayed completion window, not for a lower-quality model, so output quality is unchanged.
How long do batches actually take to complete?
The published ceiling is up to 24 hours, but most batches finish well inside that window, often in minutes to a couple of hours depending on size and current load. Design for the 24-hour worst case and treat faster completion as upside.
Can I combine batch with prompt caching?
Yes. When batched requests share a common prefix, the cached portion gets both the batch discount and the caching discount, which is the cheapest way to run repetitive large-context jobs.
What is the smallest workload worth moving to batch?
Any recurring offline job large enough that you notice it on the bill. Evals, bulk classification, and scheduled generation are the usual first wins because they are both sizeable and entirely non-interactive.
Will batching slow down my product for users?
Only if you batch the wrong thing. Keep every user-facing, latency-sensitive request on the synchronous path and route only deferrable, no-one-is-waiting work to batch. Done correctly, users see no change and the bill drops.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.