LLM Structured Outputs in Production SaaS

If your SaaS feature parses an LLM response with a regex and a prayer, you already know the failure mode: it works in the demo, then a customer hits an edge case and the parse throws at 2am. Plain JSON-mode prompting fails to produce schema-valid output 8 to 15 percent of the time. At 50,000 calls a day that is thousands of broken responses, each one a retry, a logged error, or a blank screen for a user.

This is a solved problem in 2026, but only if you pick the right level of enforcement and stop confusing valid JSON with correct JSON. Here is how we wire structured outputs into production SaaS features so the parse never throws and the data is actually usable.

The three levels of structure enforcement

There are three distinct ways to get structured data out of an LLM, and they are not interchangeable. Most teams reach for level one, get burned, and never learn that levels two and three exist.

Level 1: prompt engineering (80 to 95 percent reliable)

You ask the model to respond only with JSON matching a shape and hope. It works most of the time, which is exactly the problem: it works often enough to ship and fails often enough to page you. There is no guarantee. The model can wrap the JSON in a markdown fence, add a chatty preamble, or hallucinate a field. Fine for a prototype, unacceptable for a feature in a paid product.

Level 2: function calling and tool use (95 to 99 percent reliable)

You define a tool with a JSON schema and the model fills in the arguments. This is a strong hint to the model, not a hard constraint, so it lands in the 95 to 99 percent range. Anthropic tool use reports around 99.8 percent schema compliance in practice. Good enough for many internal flows, still not something you want fronting a customer-facing write path without a validator behind it.

Level 3: native structured output (100 percent schema-valid)

Constrained decoding masks invalid tokens during generation using a finite state machine built from your schema, so the model literally cannot emit a token that breaks the structure. OpenAI Structured Outputs reports 99.9 percent compliance; Gemini schema mode around 99.7 percent. This is the level you want for anything that writes to a database, drives a UI, or feeds another service. The cost is 30 to 300 extra tokens per call for the schema, which is far cheaper than the retries it eliminates.

Schema-valid is not the same as correct

Here is the trap that bites teams who think structured output finished the job. Constrained decoding guarantees the shape. It does not guarantee the content. Your output can be perfectly valid JSON and completely wrong: a confidence score of 0.95 on a hallucinated answer, a category that does not exist in your taxonomy, a date in the future for a past event.

We treat the schema as a contract for shape and a separate validation layer as the contract for meaning. After the model returns schema-valid JSON, run cheap deterministic checks: is the enum value in your allowed set, is the referenced ID present in your database, does the number fall in a sane range. These are ordinary application-code assertions, not more LLM calls. They catch the semantic failures that schema enforcement cannot. This is the same discipline we describe in writing acceptance criteria and evals for AI features: define what correct means before you ship, then enforce it in code.

A production checklist for structured outputs

This is the setup we ship for SaaS teams adding an extraction, classification, or routing feature on top of an LLM.

Design the schema for the consumer, not the model

Keep fields flat where you can, use enums instead of free-text strings for anything you will branch on, and mark optional fields as nullable rather than omitting them. A nullable field the model fills with null is easier to handle than a missing key your code did not expect. Avoid deeply nested objects; every level of nesting is another place the model can drift.

Validate, then decide what a failure means

Wrap every structured call in a validator (Pydantic in Python, Zod in TypeScript) even when you use level-three enforcement, because providers ship bugs and schemas drift. On a validation failure, decide in advance: retry once with the error message fed back, fall back to a safe default, or surface a graceful could-not-process state. Never let a parse exception reach the user. When the LLM API itself fails, you need a separate resilience layer, which we cover in keeping an AI feature up when the LLM API fails.

Log the raw output alongside the parsed result

When something looks wrong in production you want the exact bytes the model returned, not just your parsed view. Store the raw response for a retention window. This is what lets you debug why a record got categorized as churn-risk without guessing. It also feeds the evals you run post-launch, which we walk through in what to measure after an AI feature ships.

When structured output is the wrong tool

If you are pulling facts out of documents and the failure mode is the model inventing values, the fix is not a tighter schema, it is better retrieval. A schema cannot stop a model from confidently filling a field with a made-up number; only grounding it in retrieved source text can. We see this constantly, and it is why reducing hallucinations in production RAG is a retrieval problem first and a prompting problem second. Structured output controls the container. Retrieval controls the truth.

For features that genuinely depend on extracting reliable structured data from messy inputs, the architecture matters as much as the prompt. A common pattern we build is retrieve-then-extract: ground the model in source text, then force a schema-valid extraction over only that text. That combination is far more reliable than either piece alone, and it is the backbone of most production RAG architectures we ship.

Frequently asked questions

Should I use JSON mode or function calling for structured output?

Neither, if your provider supports native structured outputs with constrained decoding. JSON mode gives you valid JSON but no schema guarantee. Function calling adds a schema hint but not a hard constraint. Native structured output enforces the schema at decode time and is the right default for production. Use function calling when you are also routing to actual tools.

Does structured output slow down my LLM calls?

Schema enforcement adds 30 to 300 tokens per call for the schema definition and a small amount of decode-time overhead, but it removes the retry loop that naive JSON parsing forces. Net latency and net cost almost always drop because you stop paying for failed calls and re-prompts.

Why does my LLM return valid JSON that is still wrong?

Schema enforcement guarantees the shape of the output, not the meaning. The model can return a perfectly structured object with hallucinated or out-of-range values. Add a deterministic validation layer in your application code to check enums, ID references, and numeric ranges, and ground extraction in retrieved source text to control correctness.

What happens when validation fails in production?

Decide the policy before you ship: retry once with the validation error fed back to the model, fall back to a safe default value, or surface a graceful failure state to the user. The rule that matters is that a parse or validation exception must never reach the customer as an unhandled error.

Getting structured outputs right is usually a few days of focused work, not a quarter, once you know which enforcement level and validation layer to use. If you have an extraction or classification feature stuck in the backlog because the output is not reliable enough to ship, that is the kind of task we take on directly. See what we build for how senior engineers ship features like this in days.

Why your LLM keeps returning broken JSON (and the fix)

The three levels of structure enforcement

Level 1: prompt engineering (80 to 95 percent reliable)

Level 2: function calling and tool use (95 to 99 percent reliable)

Level 3: native structured output (100 percent schema-valid)

Schema-valid is not the same as correct

A production checklist for structured outputs

Design the schema for the consumer, not the model

Validate, then decide what a failure means

Log the raw output alongside the parsed result

When structured output is the wrong tool

Frequently asked questions

Should I use JSON mode or function calling for structured output?

Does structured output slow down my LLM calls?

Why does my LLM return valid JSON that is still wrong?

What happens when validation fails in production?

Rather we just build it?

Why your LLM keeps returning broken JSON (and the fix)

The three levels of structure enforcement

Level 1: prompt engineering (80 to 95 percent reliable)

Level 2: function calling and tool use (95 to 99 percent reliable)

Level 3: native structured output (100 percent schema-valid)

Schema-valid is not the same as correct

A production checklist for structured outputs

Design the schema for the consumer, not the model

Validate, then decide what a failure means

Log the raw output alongside the parsed result

When structured output is the wrong tool

Frequently asked questions

Should I use JSON mode or function calling for structured output?

Does structured output slow down my LLM calls?

Why does my LLM return valid JSON that is still wrong?

What happens when validation fails in production?

Keep reading

Rather we just build it?