The customer
A vertical-SaaS HR-tech platform with three AI features in production. Releases were "vibes-based" — engineers eyeballed a few examples before merging. Two regressions had reached customers in the prior quarter.
The task they submitted
Stand up an eval pipeline that runs on every PR. We need to be able to merge without praying.
Our approach
Curated 640 eval cases from real product traffic, designed graders per feature (mix of LLM-judge and deterministic checks), wired it into GitHub Actions, gated merges on a configurable score threshold, and surfaced diff dashboards in PR comments.
The outcome
Zero regressions in the 90 days since shipping. Two real bugs caught at PR time before merge. Engineers report 3× higher confidence merging AI changes.
“Eval-in-CI changed how we ship AI features. We're 3× more confident merging, and we're catching real issues.”
