AI feature acceptance criteria with evals

Your AI feature probably has no acceptance criteria. It has a demo that looked good in standup, a "ship it" in Slack, and a quiet hope that production looks like the demo. For deterministic features that is sloppy. For AI features it is the reason the thing breaks the week after launch.

The problem is structural. The moment a request routes through a language model, the feature's behavior becomes a distribution, not a fixed value. The classic "given X, when Y, then Z" acceptance criterion assumes one input maps to one output. An LLM gives you a different answer on the same input depending on phrasing, retrieved context, or a silent model update. So you cannot assert your way to "done." You have to score it.

Rewrite each "then" as an eval case scored on a property, not a string match.
A feature is "done" when it clears a pass-rate threshold on a golden dataset of real inputs.
Run evals pre-merge in CI, a canary on rollout, and quality monitoring in production.
Start with 50 to 100 cases; grow to 500-plus as real failures arrive.

Why traditional acceptance criteria fail for AI features

A normal acceptance criterion is a binary assertion: click save, the record persists. You can write it once and it stays true. An AI feature has no single correct output to assert against. "Summarize this ticket" has a thousand acceptable summaries and a thousand bad ones, and the line between them is fuzzy. A string-equality test is useless, and skipping the test entirely is how teams end up shipping a feature whose quality nobody actually measured.

This is the same trap behind features that demo well and then stall, which we unpacked in why AI pilots never reach production. The work that gets skipped is not the model integration. It is defining, in advance and in numbers, what "good enough to ship" means.

Turn each "then" into an eval case

You do not throw away your acceptance criteria. You translate them. Each "then" clause becomes a property you can score across many inputs, and the collection of those properties, with a required pass rate, becomes the executable spec for the feature.

Take an AI reply-drafting feature. The old criterion "the draft addresses the customer's question" becomes a scored property: across 100 representative tickets, does the draft reference the actual question? You measure it with an LLM-as-judge rubric, a semantic similarity score, or a human spot-check, and you set a pass rate. The feature is done when it clears, say, 90 percent on that dataset, not when it produced one good draft in a meeting.

Score properties, not exact strings

For freeform output, useful signals include a semantic similarity gate (cosine similarity above 0.8 is a common bar), overlap metrics like ROUGE above 0.7 where a reference answer exists, and rubric-based LLM-as-judge scoring for qualities like relevance and tone. Where you can demand structure, do: if downstream code expects JSON, validate the schema and treat a parse failure as an automatic fail.

Build a golden dataset of real inputs

A golden dataset is a reviewed, versioned set of representative inputs with expected outputs or rubrics. Start with 50 to 100 cases that cover the happy path, the obvious edge cases, and your known failure modes. Then grow it: every production miss becomes a new case. Teams that take this seriously reach 500-plus cases within the first quarter, and that dataset becomes the single most valuable asset the feature has.

The three layers that make "done" stick

One eval run before launch is not enough, because the inputs and the model both drift. Teams that ship probabilistic features responsibly run three layers, and each catches what the one above it misses.

Layer 1: evals in CI

When an engineer changes a prompt, swaps a model version, or adjusts retrieval, the eval suite reruns the golden dataset and reports whether the feature got better or quietly worse, before the change reaches main. A common CI gate blocks the merge if semantic similarity falls below roughly 95 percent of the baseline or the hallucination rate spikes. This is the layer that replaces "acceptance criteria met" on your definition of done.

Layer 2: canary on rollout

CI proves the change is good on your dataset. A canary proves it on real users. Expose the new version to a slice of traffic and watch the same quality metrics live, with an automatic rollback if a rolling-mean score drops 2 to 3 points and holds. We cover the mechanics of this in how to ship an AI feature without breaking production.

Layer 3: production monitoring

Once the feature is live for everyone, keep scoring a sample of real traffic. This is what catches a silent provider model update or a slow drift in user inputs that your fixed golden dataset never saw. The production misses you find here flow straight back into Layer 1 as new eval cases.

A definition of done you can actually enforce

Rebuild the definition of done for any AI feature around numbers a reviewer can check:

An eval suite of at least 50 to 100 cases exists and is versioned.
The feature clears the agreed pass-rate threshold on that suite.
The eval runs in CI and blocks merges on regression.
Cost-per-1,000-uses is known. If you cannot answer it, the feature is not done.
Uncertainty is handled: if the model can take an action, it has guardrails or a human gate.

That last point is the one teams underweight. If a human is always the final approver, you can tolerate more model uncertainty. If the feature can send an email or change a record on its own, you design it with the care of a payments system. For the retrieval-heavy version of all this, the same discipline applies to RAG, which we detail in the production RAG architecture guide.

FAQ

How many eval cases do I need before I ship?

Start with 50 to 100 covering the happy path, edge cases, and known failure modes. That is enough to set a meaningful pass rate. Then grow the set continuously from real production misses, aiming for several hundred cases within the first quarter. The dataset is never "finished"; it tracks the feature.

What pass rate should count as done?

There is no universal number; it depends on the cost of a wrong answer. A low-stakes suggestion can ship at a lower bar than a feature that takes an irreversible action. Set the threshold with the team before launch, write it down, and gate CI on it. The discipline of agreeing the number up front matters more than the exact figure.

Who writes the evals, engineering or product?

Both. Product owns what "good" means and supplies real inputs and judgments. Engineering turns those into a scored, automated suite that runs in CI. Treating evals as the shared acceptance criteria is what keeps the two roles pointed at the same definition of done.

Is this overkill for a small AI feature?

A 50-case eval suite and a CI gate are a few days of work, not a research project, and they pay for themselves the first time a model update would have silently degraded quality. If you want a senior team to set up the eval harness alongside the feature, that is exactly the kind of scoped task our subscription ships.

Acceptance criteria for AI features that actually ship

Why traditional acceptance criteria fail for AI features

Turn each "then" into an eval case

Score properties, not exact strings

Build a golden dataset of real inputs

The three layers that make "done" stick

Layer 1: evals in CI

Layer 2: canary on rollout

Layer 3: production monitoring

A definition of done you can actually enforce

FAQ

How many eval cases do I need before I ship?

What pass rate should count as done?

Who writes the evals, engineering or product?

Is this overkill for a small AI feature?

Rather we just build it?

Acceptance criteria for AI features that actually ship

Why traditional acceptance criteria fail for AI features

Turn each "then" into an eval case

Score properties, not exact strings

Build a golden dataset of real inputs

The three layers that make "done" stick

Layer 1: evals in CI

Layer 2: canary on rollout

Layer 3: production monitoring

A definition of done you can actually enforce

FAQ

How many eval cases do I need before I ship?

What pass rate should count as done?

Who writes the evals, engineering or product?

Is this overkill for a small AI feature?

Keep reading

Rather we just build it?