← Back to writing

Observability isn't evals: why your AI agent fails silently

Observability tells you an AI agent is running. Evals tell you it is right. Those are different questions, and most teams only answer the first one. You can watch every trace, log every token, and chart p95 latency to the millisecond while the agent quietly returns wrong answers to real users. The fix is not more dashboards. It is a second layer, graded against expected outputs, that fails the build when quality drops.

This is the gap behind a lot of production agent failures in 2026. Teams instrument heavily, then get surprised by a refund-flow agent that approved the wrong cases for three weeks. The traces were green the whole time.

Observability and evals answer different questions

Observability is the operational view. It captures what happened: the prompt, the retrieved context, the tool calls, the tokens, the latency, the cost, the error rate. It is the same discipline you already apply to any service, extended to cover LLM-specific fields. It is necessary, and the industry has adopted it fast.

Evals are the correctness view. An eval takes an input, runs the agent, and grades the output against what a good answer looks like. Did the support agent cite the right policy? Did the extraction agent pull the correct invoice total? Did the router send the ticket to the right queue? Observability cannot answer those questions, because a wrong answer and a right answer produce identical traces. Same latency, same token count, same green dashboard.

The two layers stack. Observability tells you the agent is up and how it behaves under load. Evals tell you whether the behavior is any good. A team with strong observability and no evals has a detailed recording of its own failures and no way to notice them.

Why your dashboards stay green while the agent is wrong

An LLM does not throw an exception when it is wrong. It returns a fluent, confident, well-formed answer that happens to be incorrect. None of your standard signals move:

  • Latency looks normal, because a wrong answer takes the same time to generate as a right one.
  • Token usage looks normal, because length is unrelated to correctness.
  • HTTP status is 200, because the call succeeded. The model did its job; it just did it wrong.
  • Error rate stays flat, because nothing errored.

This is what makes agent failures silent. A traditional service fails loudly: a 500, a timeout, a stack trace. An agent fails politely. The only thing that catches it is a graded comparison against an expected output, which is exactly what observability does not do. We covered the post-launch version of this in what happens to AI quality after launch: the model meets messy real inputs and drifts, and without a correctness signal nobody finds out until a customer does.

A worked example: the refund agent that looked healthy

Take a SaaS billing product that ships an agent to triage refund requests. It reads the ticket, checks the account, and recommends approve, deny, or escalate. The team instruments it well: full traces, latency charts, cost per ticket, a token budget alert. For two weeks the dashboards are clean. Median latency 1.9s, cost about 4 cents per ticket, zero errors.

Then finance notices refunds are up 18 percent. The agent had been approving requests that fell just outside the policy window, because a prompt change three weeks earlier softened how it read the eligibility rule. Every one of those approvals was a clean, fast, 200-status, fully-traced success. Observability recorded the damage in perfect detail and flagged none of it.

An eval suite would have caught it before the prompt shipped. Fifty labeled refund cases, ten of them just outside the window, graded on whether the agent's recommendation matched the correct decision. The softened prompt drops from 100 percent to 80 percent on that set, the eval gate fails, and the change never reaches production. The cost of building that suite was a day. The cost of not having it was three weeks of wrong refunds and a finance escalation.

How to add evals on top of observability

You do not need a platform or a new vendor to start. You need a graded test set and a gate. Here is the order that works.

1. Mine your traces for a labeled set

Your observability data is the raw material for evals. Pull 50 to 100 real production interactions, including the weird ones, and label the correct output for each. Real traffic beats synthetic cases because it carries the actual distribution of messy inputs your agent meets. This is the one place the two layers connect directly: observability supplies the data, evals supply the grade.

2. Write the acceptance criteria first

Before you score anything, decide what a good answer is. For the refund agent it is decision-match against the policy. For a support agent it might be did-it-cite-the-right-doc plus did-not-hallucinate-a-feature. Vague targets are a top cause of agents getting pulled from production, so write these down as concrete, checkable rules. We go deeper on this in writing acceptance criteria for an AI feature.

3. Pick a grader that fits the task

Not every eval needs an LLM judge. Use exact match or a regex when the output is structured (a category, an ID, a number). Use a model-graded check only for open-ended text, and validate that grader against a handful of human labels so you trust its scores. The cheapest reliable grader that fits the output is the right one. Most teams over-reach for LLM-as-judge when a string comparison would do.

4. Gate the build on the score

An eval suite that runs once and lives in a notebook decays within a month. Wire it into CI so every prompt change, model swap, or tool edit runs the set and fails the build below a threshold you set. This is the same discipline as shadow mode and canary rollouts, applied to correctness instead of stability. The gate is what turns a test set into a guardrail.

5. Watch for eval drift, not just model drift

Your eval set is a snapshot of yesterday's traffic. As real inputs shift, a suite that scored 95 percent can quietly stop representing what users actually send. Refresh it from new traces on a schedule, and add a case every time production surfaces a failure mode the set missed. The eval set is a living artifact, not a fixed exam.

A 30-day plan to close the gap

If you already have observability and no evals, you can close the gap in a month without pausing delivery.

  • Week 1: export 50 real interactions from your traces and label the correct output for each.
  • Week 2: write acceptance criteria, pick a grader per task, and get a baseline score on the current agent.
  • Week 3: wire the suite into CI as a gate that fails the build below your threshold.
  • Week 4: add the failure modes you find back into the set, and set a monthly refresh from new traces.

The point is not to grade everything. It is to grade the decisions that cost money when they are wrong, and to make a bad score block a release. Strong context engineering for production agents reduces how often the agent is wrong; evals are how you prove it stayed that way after the next change.

Frequently asked questions

Is observability still worth it if I have evals?

Yes. They cover different failures. Observability catches operational problems: a slow tool, a cost spike, a retrieval timeout, a rate limit. Evals catch correctness problems: a confident wrong answer that returns a clean 200. You want both. Evals tell you what to fix; observability traces tell you why it broke.

How many eval cases do I need to start?

Start with 50 real, labeled interactions and grow from there. A small set that runs on every change and gates the build beats a large set that runs once a quarter. Add cases as production surfaces new failure modes rather than trying to enumerate everything up front.

Do I need an LLM-as-judge for evals?

Often no. If the output is a category, an ID, or a number, exact match or a regex is faster, cheaper, and more reliable. Reserve model-graded scoring for open-ended text, and check that grader against human labels before you trust it. Many teams reach for a judge model when a string comparison would be more honest. See the eval mistakes that sink a RAG system for the common traps.

Where do evals live, in CI or in production?

Both, with different jobs. In CI, an eval suite gates changes before they ship. In production, you sample live traffic and grade it offline to catch drift the static set misses. The CI gate prevents known regressions; the production sample finds new ones.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.