← Back to writing

Why your LLM JSON keeps breaking in production (and the fix)

  • Telling a model to "respond in JSON" fails parsing 8 to 15 percent of the time once you run real traffic through it.
  • Schema-constrained decoding (OpenAI Structured Outputs, Anthropic tool use, Gemini schema) drops malformed JSON below 0.1 percent because the model physically cannot emit a token that breaks the schema.
  • Valid JSON is not correct JSON. A schema guarantees shape, not meaning, so you still need value validation and evals on top.
  • The cost of getting this right is small: roughly 30 to 300 extra tokens per call and a one-time schema definition.

The most common AI feature failure we get called in to fix is not a hallucination or a slow model. It is a parsing error. A team ships an LLM call that is supposed to return JSON, it works in the demo, and three weeks later the on-call engineer is staring at a stack trace because the model wrapped its answer in a markdown code fence, or added a trailing comma, or prefixed the object with "Here is the JSON you requested:". The feature is down, and the root cause is that the prompt asked for JSON instead of guaranteeing it.

This post is the playbook we use to make structured output actually reliable in production, with the tradeoffs spelled out.

"Respond in JSON" is a request, not a contract

When you put "return your answer as JSON" in a prompt and parse the raw string, you are trusting a probabilistic text generator to produce syntactically perfect output every single time. At low volume it mostly works, which is exactly why the bug ships. At scale the failure rate is measurable. Across reported production numbers, unconstrained JSON prompts fail to parse somewhere between 8 and 15 percent of the time.

Each failure is expensive in three ways. You pay for the wasted generation, then you pay again for the retry, which roughly doubles the token cost of that request. The retry adds 500 to 2000 ms of latency, so the user feels it. And someone on your team has to write and maintain the brittle regex-and-fallback code that tries to salvage the broken string. That parsing layer becomes its own source of bugs.

The fix is to stop asking and start constraining. Modern providers expose decoding-level guarantees that make malformed output impossible rather than improbable.

Three ways to get JSON, ranked by reliability

JSON mode (better, not enough)

JSON mode tells the API "the response must be parseable JSON." The provider biases decoding so you get a syntactically valid object. That removes the markdown-fence and trailing-comma class of bugs, but it does not enforce your field names, types, or required keys. Reported schema-mismatch rates for plain JSON mode still sit around 2 to 5 percent on the major providers, and higher on smaller models. You get valid JSON that is the wrong shape.

Function and tool calling (good)

Tool calling hands the model a named function with a typed argument schema and asks it to fill in the arguments. Because the provider validates against that schema, the shape is far more reliable. Anthropic tool use reports malformed-output rates below 0.2 percent. This is the right primitive when the JSON represents an action the model is choosing to take, and it is the foundation of most agent loops. If you are building agents, the same discipline applies to tool definitions and failure handling that we cover in hardening production AI agents against prompt injection.

Structured outputs with constrained decoding (best)

Structured outputs is JSON mode's disciplined older sibling. You pass an actual JSON Schema, and the provider uses constrained decoding: at each step it masks out any token that would violate the schema, so the model literally cannot produce a non-conforming character. Field names are correct, types are correct, and every required property is present. Reported failure rates drop below 0.1 percent. If you are shipping to production and your provider supports it, this is the one to reach for.

The overhead is modest. Constrained decoding adds roughly 30 to 300 tokens of schema overhead per call and a small amount of latency for schema compilation on the first request, which most providers cache. Compared to the cost of a doubled retry and a 500 ms-plus penalty on 1 in 10 calls, the math is not close.

Valid JSON is still not correct JSON

Here is the trap teams fall into after they adopt structured outputs: they assume the parsing problem is solved, so the data must be right. It is not. Constrained decoding guarantees the shape of the object. It says nothing about whether the values inside are true, in range, or sane.

A schema can require an invoice_total field of type number. It cannot stop the model from putting 0 there when the real total is 4,200. It can require a status string. It cannot stop the model from inventing a status your system has never heard of unless you constrain it to an enum. So the schema does real work, but it is the floor, not the ceiling.

Two layers belong on top of every structured call in production:

  • Deterministic validation in your own code. Re-validate with a Pydantic or Zod model, check enums against your real allowed values, bound numeric ranges, and reject anything that references an ID that does not exist in your database. This is cheap and catches the obvious nonsense.
  • Evals for the things validation cannot see. Whether the extracted total is actually the right number, whether the summary is faithful to the source, whether the classification matches a human label. This is the same evidence-led approach we describe in the eval mistakes that let bad RAG ship and in turning AI feature acceptance criteria into evals.

If the values feeding your structured output come from retrieval, the upstream quality matters more than the schema. Garbage context produces well-formatted wrong answers, which is why we treat grounding as a first-class concern in reducing hallucinations in production RAG.

A production checklist

This is the order we apply when hardening an LLM feature that emits structured data:

  • Use schema-constrained structured outputs where the provider supports it. Fall back to tool calling, then JSON mode, in that order. Never parse a raw free-text response.
  • Keep the schema as tight as the domain allows. Use enums for known value sets, mark fields required, and avoid open string fields where a constrained type will do.
  • Re-validate the parsed object in application code. Treat the model output as untrusted input, the same way you would treat a request body from a browser.
  • Log every validation failure with the raw output. The few cases that slip through are your best eval material.
  • Watch token and latency overhead. Structured outputs add a small constant cost. If your call volume is high, fold it into the same budgeting discipline as time-to-first-token and inference latency.

Done in this order, the parsing-error class of incidents disappears, and the residual failures move to the place you can actually measure them: the values, not the syntax.

Frequently asked questions

Is JSON mode the same as structured outputs?

No. JSON mode only guarantees that the response is syntactically valid JSON. Structured outputs accept a full JSON Schema and use constrained decoding to guarantee field names, types, and required properties as well. JSON mode still mismatches the intended schema 2 to 5 percent of the time on major providers; structured outputs report below 0.1 percent.

Does constrained decoding hurt answer quality?

In practice the effect is small for well-designed schemas, and the reliability gain dwarfs it. The main risk is an over-constrained schema that forces the model into a shape that does not fit the task. Keep schemas tight on values you control (enums, required keys) and leave room where the model genuinely needs it.

What do I do if my provider does not support structured outputs?

Use tool or function calling, which is widely supported and far more reliable than raw JSON mode. If only JSON mode is available, add a strict schema validator in your own code and a single bounded retry, and log every failure so you can quantify the gap.

How much extra does this cost?

Roughly 30 to 300 tokens of schema overhead per call plus a one-time schema-compilation latency that most providers cache. That is far cheaper than retrying 1 in 10 calls, where each retry doubles the token cost and adds 500 ms or more of latency.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.