LLM-as-a-judge: calibrate before you trust the score

An LLM-as-a-judge is itself an LLM, so it carries the same failure modes as the feature you are testing: it hallucinates, it drifts, and it is biased by length, position, and its own style. If you gate releases on an uncalibrated judge, it will hand you confident, wrong scores, and you will ship regressions while the eval dashboard stays green. Before a judge decides what reaches production, measure how often it agrees with your own engineers on a small labeled set. If the judge disagrees with your team more than your team disagrees with itself, it is not a gate, it is a coin flip.

Why teams reach for an LLM judge in the first place

Human grading does not scale. A support copilot that handles ten thousand conversations a week cannot be hand-graded on every release, and a RAG answer engine produces outputs that no string match can score. So teams hand the scoring to another model: a prompt that reads the feature output, compares it to a rubric or a reference answer, and returns a pass, a fail, or a number.

This is the right instinct. It is the only way to score open-ended generation at the volume a real product needs, and it is what makes eval-driven development possible at all. The problem is not the technique. The problem is that most teams wire the judge straight into the release pipeline and trust its number on day one, the same way they would trust an assertion in a unit test. A judge is not an assertion. It is a probabilistic system that has to earn your trust before it can hold a gate.

The failure modes that make an uncalibrated judge dangerous

An LLM judge fails in ways a regex never could, and most of them are invisible until you look for them.

Self-preference and sycophancy

A judge tends to score outputs that match its own style and phrasing higher than outputs that do not, even when the substance is identical. If your feature and your judge use the same base model family, the judge quietly rewards answers that sound like the model rather than answers that are correct. It also tends to agree with whatever framing is in the prompt, so a leading rubric like "confirm the answer is helpful" pulls scores up across the board.

Length and position bias

Longer answers read as more thorough to a judge, so a verbose, padded response often beats a tight, correct one. In pairwise comparison the order matters too: the answer shown first wins more often than chance, regardless of quality. A judge that ranks a padded reply above a tight, correct one is measuring word count, not accuracy.

Rubric drift

A vague rubric invites the judge to invent its own standard, and that standard moves between runs. "Rate the answer from 1 to 5 on quality" produces a 4 today and a 3 tomorrow for the same text, because nothing in the prompt pins what a 4 means. The score looks precise and is mostly noise.

The judge breaks when the provider updates it

If you call a hosted model as your judge and the provider ships a new version, your scoring function changes with no commit on your side. A gate that passed last week can fail this week, or worse, pass a regression it used to catch. This is the same silent-shift problem we covered in catching LLM regressions in CI before a model update breaks prod, except now it is your measuring instrument that moved, not the feature.

Calibrate the judge before you trust it

Calibration is the step almost everyone skips. It is not complicated, and it is the difference between a gate and a guess.

Build a small human-labeled set

Pull 50 to 100 real outputs from the feature you want to gate. Have two of your own engineers, ideally the people who own the feature, label each one independently against the same pass or fail criteria you will give the judge. Disagreements between your two humans are not a bug to hide. They are your noise floor, and you need that number.

Measure agreement, not vibes

Now run the judge over the same set and compare its labels to the human labels. The metric you want is agreement rate, and ideally a chance-corrected one like Cohen's kappa so that a judge that always says "pass" on a mostly-passing set does not look good for free. Report it as a number, not a feeling. A judge at 71 percent raw agreement on a binary task is barely above a fixed guess and has no business holding a gate.

Set the bar against your humans, not against 100 percent

Here is the rule that keeps you honest: the judge only earns the gate if it agrees with your engineers at least as often as your engineers agree with each other. If your two labelers agree 90 percent of the time and the judge agrees with the consensus 88 percent of the time, the judge is roughly as reliable as a careful human reviewer, and you can trust it. If your humans agree 90 percent and the judge lands at 70 percent, the judge is measuring something other than what you care about, and no threshold tuning will fix that. This reframes acceptance criteria for AI features around a pass rate you can actually defend.

A worked example: gating a B2B support copilot

Take a billing-support copilot inside a SaaS product. The team wants to gate every release on "does the answer resolve the customer's billing question correctly," and they stand up an LLM judge to score it. On the first run the judge reports a 94 percent pass rate, and everyone relaxes.

Then they build a 60-example calibration set. Two engineers label each conversation pass or fail; they agree on 55 of 60, a 92 percent human-human agreement. They run the judge on the same 60. The judge agrees with the human consensus on only 41 of them, about 68 percent. Digging in, the judge was passing answers that confidently quoted the wrong refund window, because those answers were long, well-formatted, and sounded authoritative. The 94 percent dashboard was fiction.

The fix was not a better model. It was a tighter rubric ("the answer is a fail if any stated policy, date, or dollar amount contradicts the billing docs"), a reference answer pulled from those docs for the judge to compare against, and a switch to a binary verdict. After the rewrite the judge hit 89 percent agreement with the humans, close enough to the 92 percent human floor to gate on. The same calibration set then became the regression suite the team runs on every release.

Design the judge for reliability

Narrow the rubric and shrink the scale

Binary pass or fail is far more reliable than a 1 to 5 scale, because there is less room for the judge to drift. If you need gradations, use three clearly defined levels with explicit anchors for each, not a vague five. Spell out exactly what makes something fail, with concrete disqualifiers like a wrong number or a contradicted policy.

Prefer pairwise over absolute when you can

Judges are better at "is A better than B" than at "score A from 1 to 10." If your goal is to confirm a new prompt did not regress against the current one, ask the judge to compare the two outputs directly, and swap the order on half the comparisons to cancel position bias.

Return a structured verdict

Have the judge return a structured object with the verdict, a short reason, and the specific rule it triggered, not a free-text paragraph you parse later. This makes failures auditable and stops the judge from burying a fail inside prose. The same discipline we describe in getting reliable JSON out of an LLM applies to the judge itself.

Pin the model and re-calibrate on a schedule

Pin the exact judge model and version so your scoring function does not move under you, and re-run the calibration set on a schedule, after any judge upgrade, and whenever the feature's behavior shifts. Calibration is not a one-time gate you pass once. It is a number you watch, the same way you watch the feature it grades.

Where the judge fits in the release pipeline

A calibrated judge is one half of a release gate. The other half is running it automatically: a golden set of inputs, the judge scoring every output, and a build that fails when the pass rate drops below the bar you set during calibration. That is eval-gated CI, and it only means anything if the instrument doing the scoring has been checked against humans first.

It is also worth being clear about what a judge does not give you. A green eval gate tells you the feature passed your golden set; it does not tell you what is happening in production, which is why observability is not evals and you need both. And if you are gating a retrieval system, the judge is only as good as the rest of your eval design, so it is worth avoiding the common RAG eval mistakes at the same time. For the broader picture of how we wire evals, gates, and monitoring together for SaaS teams, the AI engineering playbook lays out the full pipeline.

Frequently asked questions

Can I use the same model as both the feature and the judge?

You can, but watch for self-preference. A judge from the same family tends to favor outputs that sound like itself. If you see the judge passing answers that read well but are wrong, try a different model family for the judge, or lean on pairwise comparison and a reference answer to anchor it.

How many human labels do I actually need?

For a binary gate, 50 to 100 labeled examples is enough to get a usable agreement number and to expose the judge's biases. The point is not statistical perfection; it is catching a judge that is off by 20 points before it holds a gate. Grow the set over time as production surfaces new failure cases.

What agreement number is good enough?

There is no universal threshold. The right bar is your own human-human agreement on the same set. If your engineers agree 90 percent of the time and the judge agrees with the consensus around 88 to 90 percent, gate on it. Far below your human floor means the judge is measuring the wrong thing.

Does this slow every release down?

The calibration is upfront work, measured once and rechecked occasionally. After that the judge runs in seconds per example in CI. The cost is small next to shipping a billing copilot that confidently quotes the wrong refund policy to paying customers.

Your LLM-as-a-judge is lying to you (until you calibrate it)

Why teams reach for an LLM judge in the first place

The failure modes that make an uncalibrated judge dangerous

Self-preference and sycophancy

Length and position bias

Rubric drift

The judge breaks when the provider updates it

Calibrate the judge before you trust it

Build a small human-labeled set

Measure agreement, not vibes

Set the bar against your humans, not against 100 percent

A worked example: gating a B2B support copilot

Design the judge for reliability

Narrow the rubric and shrink the scale

Prefer pairwise over absolute when you can

Return a structured verdict

Pin the model and re-calibrate on a schedule

Where the judge fits in the release pipeline

Frequently asked questions

Can I use the same model as both the feature and the judge?

How many human labels do I actually need?

What agreement number is good enough?

Does this slow every release down?

Rather we just build it?

Your LLM-as-a-judge is lying to you (until you calibrate it)

Why teams reach for an LLM judge in the first place

The failure modes that make an uncalibrated judge dangerous

Self-preference and sycophancy

Length and position bias

Rubric drift

The judge breaks when the provider updates it

Calibrate the judge before you trust it

Build a small human-labeled set

Measure agreement, not vibes

Set the bar against your humans, not against 100 percent

A worked example: gating a B2B support copilot

Design the judge for reliability

Narrow the rubric and shrink the scale

Prefer pairwise over absolute when you can

Return a structured verdict

Pin the model and re-calibrate on a schedule

Where the judge fits in the release pipeline

Frequently asked questions

Can I use the same model as both the feature and the judge?

How many human labels do I actually need?

What agreement number is good enough?

Does this slow every release down?

Keep reading

Rather we just build it?