LLM eval harness: catch regressions in CI

Provider model updates ship more often than your own releases and quietly shift behavior, so a feature that passed last month can degrade with no code change.
A small golden dataset of 30 to 50 reviewed cases plus three metrics, run on every pull request, catches more regressions than a large framework nobody maintains.
Keep the suite under five minutes so it runs in CI like any other test, and gate merges on it.
Mix deterministic checks (exact match, schema, must-contain) with a small number of LLM-as-judge checks for the subjective parts.

The scariest class of AI bug is the one that arrives with no commit attached. Your prompt did not change, your code did not change, and yet last week's behavior is gone. The cause is usually upstream: the provider pushed a new model snapshot, deprecated the one you pinned, or silently re-tuned the endpoint behind your API call. Frontier models now ship on a faster cadence than most internal release cycles, and each update can move tone, formatting, refusal behavior, and tool-call structure.

You cannot stop providers from shipping. What you can do is build a tripwire that tells you the moment behavior drifts, before your users find out. That tripwire is a small, fast eval suite wired into CI. Here is how we build one.

The silent regression problem

Most teams test LLM features the way they tested the demo: a human types a few prompts, eyeballs the answers, and ships. That works exactly once. It gives you no signal when a model update changes the distribution of outputs, because there is nothing to compare against. By the time a support ticket arrives, the regression has been live for days.

The fix is the same one that solved this problem for ordinary software decades ago: a fixed dataset, fixed checks, and fixed thresholds that run automatically. The only twist is that LLM outputs are not deterministic, so some checks have to tolerate variation. That is a solvable problem, not a reason to skip the suite. This is also where the distinction between evals and live monitoring matters, which we unpack in AI agent observability versus evals.

Build a golden dataset that catches real failures

Start at 30 to 50 cases, not 500

The instinct is to build a comprehensive dataset before you trust it. Resist it. A suite of 30 to 50 reviewed examples that runs in under five minutes will catch regressions every week. A 500-case framework that takes 40 minutes and breaks the build with flaky checks gets disabled within a month. Start small, run it on every pull request, and grow the set every time a real failure slips through.

Source cases from production, not your imagination

The best examples are the ones your users actually send. Pull representative sessions from your logs across the segments and difficulty levels you serve, strip anything sensitive, and freeze them as inputs with reviewed expected outputs. Synthetic cases you invent at your desk tend to test the happy path the model already handles. Real traffic surfaces the messy inputs that break things.

Cover four kinds of case

A golden set earns its keep when it spans more than the happy path. Include clearly correct inputs with a known good answer, recoverable edge cases where the model should ask for clarification or degrade gracefully, unrecoverable inputs where the right move is to refuse or hand off, and adversarial inputs that try to break the system. For agents and tool-using features, that adversarial slice should include the injection patterns we describe in defending production AI agents against prompt injection.

Wire it into CI in under five minutes

The whole point is that the suite runs like any other test. Put the golden dataset in your repo, write a runner that calls your feature for each case and scores the output, and add it as a required check on every pull request. Promptfoo, DeepEval, or a hand-rolled script all work; the runner matters less than the discipline of gating merges on it.

Three rules keep it healthy:

Set a pass threshold per metric, not per case. Expect a few cases to wobble; fail the build when the aggregate score drops below the bar you set, not when any single output changes wording.
Run it on a schedule too, not only on pull requests. A nightly run against the live model endpoint is what actually catches provider-side updates, since those land with no commit on your side.
Pin your model version explicitly and treat a version bump as a code change that must clear the suite. When you do change models, the same evals tell you whether the cheaper or faster option is safe, which is the backbone of the approach in model routing to cut AI costs.

What to measure

Lean on deterministic checks wherever the task allows, because they are fast, free, and never flaky. Exact match for classification labels, JSON schema validation for structured output, must-contain or must-not-contain assertions for required facts and banned content, and numeric tolerance for extracted values. These cover most of a typical feature.

Reserve LLM-as-judge for the genuinely subjective parts: faithfulness of a summary to its source, helpfulness of an answer, tone. Keep the judge prompts versioned and audited, because a drifting judge gives you false confidence. The common evaluation mistakes that make a judge lie to you are the same ones we catalog in the eval mistakes that let bad RAG ship, and the practice of writing acceptance criteria as runnable checks is covered in turning AI feature acceptance criteria into evals.

Measured this way, a model update that breaks your feature shows up as a red build, not a customer complaint. That is the entire return on a few hours of setup.

Frequently asked questions

How many eval cases do I actually need to start?

Thirty to fifty reviewed cases is enough to start catching regressions, as long as the suite runs in under five minutes and on every pull request. Grow the set whenever a real production failure gets past it. A small suite that runs constantly beats a large one that runs never.

How do I handle non-deterministic LLM outputs in CI?

Score on aggregate thresholds rather than exact output matching, use deterministic checks (schema, must-contain, numeric tolerance) wherever possible, and reserve LLM-as-judge for subjective criteria. Set the temperature low for the eval run if your feature allows it, and fail the build on a drop in the overall pass rate, not on any single reworded answer.

Will this catch provider model updates?

Only if you run the suite on a schedule against the live endpoint, not just on pull requests. Provider updates land with no commit on your side, so a nightly or weekly scheduled run is what surfaces them. Pin your model version and treat any bump as a change that must clear the evals first.

Do I need a dedicated eval platform?

No. A versioned dataset in your repo, a short runner script, and your existing CI are enough to start. Tools like Promptfoo or DeepEval reduce boilerplate, but the value comes from the golden dataset and the merge gate, not the platform.

Catch LLM regressions in CI before a model update breaks prod

The silent regression problem

Build a golden dataset that catches real failures

Start at 30 to 50 cases, not 500

Source cases from production, not your imagination

Cover four kinds of case

Wire it into CI in under five minutes

What to measure

Frequently asked questions

How many eval cases do I actually need to start?

How do I handle non-deterministic LLM outputs in CI?

Will this catch provider model updates?

Do I need a dedicated eval platform?

Rather we just build it?

Catch LLM regressions in CI before a model update breaks prod

The silent regression problem

Build a golden dataset that catches real failures

Start at 30 to 50 cases, not 500

Source cases from production, not your imagination

Cover four kinds of case

Wire it into CI in under five minutes

What to measure

Frequently asked questions

How many eval cases do I actually need to start?

How do I handle non-deterministic LLM outputs in CI?

Will this catch provider model updates?

Do I need a dedicated eval platform?

Keep reading

Rather we just build it?