← Back to writing

Silent failures: why your AI agent's worst bugs never throw

The most dangerous bug in a production AI agent is the one that never throws. When a tool returns wrong-but-well-formed data, or the model improvises around a broken response, no exception fires, no alert pages anyone, and the agent hands a confident wrong answer to a customer. The failure that matters is not the one that crashes. It is the one where the error signal never reaches a human in a form they can act on.

Silent failures are now the leading cause of AI agent incidents in production, and they are hard to catch precisely because they look like success. A 200 OK from a tool call does not mean the output is correct. This post breaks down the three shapes silent failures take, why ordinary error handling misses all of them, and four engineering moves that make an agent fail loudly instead of confidently.

Why silent failures are an AI-specific problem

In ordinary software, an error propagates because the caller checks it. A function throws, the caller catches or crashes, and the failure is visible somewhere. That contract assumes a caller who inspects what it got back.

An LLM is a caller that never does this. Hand it a malformed tool result, a stale record, or an empty list, and it will not raise. It will generate plausible text that smooths over the gap. The model's core behavior, producing fluent output from whatever context it has, is exactly what turns a recoverable error into an invisible one. The worse the input, the more the model has to invent, and the more confident the invented answer tends to read.

That is why you cannot bolt standard try/catch onto an agent and call it handled. The exception layer protects your code. It does nothing about the model treating a bad value as fact.

The three shapes of a silent failure

The improvised recovery

A tool call fails or returns garbage, and instead of stopping, the model paraphrases around it and continues. A retrieval tool times out and returns an empty context, so the agent answers from its own weights and presents a guess as a grounded fact. A wrong argument at step two silently corrupts every step after it, because each step trusts the state the previous one wrote. Nothing in the run looks broken. The trace shows a clean sequence of tool calls, all green.

The well-formed lie

The tool returns a 200 with data that is schema-valid but semantically wrong: a stale read from a lagging replica, a default row, an empty result set that should have had one item. Your JSON schema validation passes because the shape is fine. The value is the problem, and shape checks never look at values.

The swallowed exception

This one is self-inflicted. Your own tool wrapper catches an error, logs it, and returns a default so the agent does not crash. The agent then treats the default as a real answer. You traded a loud crash for a quiet wrong answer, which is a strictly worse outcome in a system that acts on the result.

A worked example: the billing copilot that quoted the wrong plan

Here is an illustrative case from the kind of B2B SaaS support copilot Boundev builds. The numbers are synthetic, but the failure mode is real.

The copilot has a get_subscription(customer_id) tool that reads plan and usage from a read replica. During a database failover, the replica lagged the primary by several minutes. A customer who had upgraded from Starter to Growth that morning asked about their overage charges. The tool returned a 200 with the customer's old Starter plan and old limits, because that is what the lagging replica still held.

The agent did exactly what it was designed to do. It read the plan, applied the Starter overage rate, and told a Growth customer they were about to be charged for exceeding a limit that no longer applied to them. No tool threw. The trace was all green. The output was fluent and specific and wrong. The team found out six days later when the customer escalated, and only then traced it back to the failover window.

The root cause was not the replica lag. Replicas lag. The root cause was that a stale-but-valid record was handed to a caller that had no way to know it was stale and every incentive to answer anyway.

Engineer the failure path, not just the happy path

The fix is to treat the failure path as a first-class part of the design, built with the same care as the happy path. Four moves do most of the work.

1. Validate tool outputs, not just inputs

Teams put a lot of care into validating what goes into a tool: typed arguments, schema checks, argument bounds. The return value usually gets none of that. Contract the output the same way. An empty result set where one row was expected, a null in a required field, a value outside a sane range, or a timestamp that is too old are all typed failures, not values the model gets to interpret. In the billing case, the output contract asserts that plan_updated_at is within a freshness window; a stale record fails the contract instead of flowing through as truth. This is the return-value half of the discipline covered in designing tools for production AI agents.

2. Fail closed and force the model to surface uncertainty

When a tool cannot answer, the agent must not be allowed to fill the gap. Make "unknown" a first-class return that the model is required to propagate: a typed failure like STALE or NOT_FOUND that maps to a fixed behavior, such as "I could not confirm your current plan, let me pull it live" plus a fallback to the primary database. Failing closed means the default outcome of any uncertainty is to stop and say so, not to guess. This is the same instinct behind an explicit fallback when the model or a dependency fails, applied one layer down at the tool boundary.

3. Make the error loud in the trace, not buried in logs

The failure signal has to reach a human in actionable form, which means the trace, not a log line nobody reads. Every tool call should record input, output, and a status, and assert invariants at the boundary: the row exists, the amount is positive, the result set is non-empty when it must be. Then alert on the class of failure that looks like success, the 200-but-empty and the contract-violation cases, because those are the ones that never page anyone otherwise. This is where silent-failure defense meets the observability-versus-evals gap: observability tells you an agent ran, invariant asserts tell you whether it ran correctly.

4. Verify the outcome, not the tool call

For any tool that changes state, the fact you care about is the resulting state, not the call returning. "I issued the refund" is a claim the model will happily make the moment the tool returns 200. "The refund shows as settled when I read it back" is a fact. Read back the state you wrote, guarded by an idempotency key so the read-after-write is safe to retry. In the same billing system, a refund tool that returned success on a request that had actually been retried once was issuing occasional double refunds; a read-after-write check on the ledger caught the second issue before it settled. The handoff reliability problem in multi-agent systems is the same idea at the seam between agents: never trust that the previous step did what it reported.

When you do not need all of this

Failure-path engineering has a real cost, and not every agent earns it. A read-only summarizer that drafts a reply for a human to approve can get by with loud logging and a person in the loop, because the human is the invariant check. The full treatment, output contracts plus fail-closed behavior plus read-after-write verification, pays for itself when the agent takes irreversible actions, feeds its output into downstream automated steps, or runs without a human reviewing each result. The more autonomous and side-effecting the agent, the more a silent wrong answer costs, and the more the failure path is worth building. If your agent executes code or hits internal systems, pair this with blast-radius containment so a wrong action is also a contained one.

Where to start this week

You do not need to re-architect the agent to make progress. Start by listing every tool the agent can call and marking which ones can return a well-formed wrong answer: a stale read, an empty set, a default row. Those are your silent-failure surface. Add an output contract and an invariant assert to the highest-traffic one, wire an alert on the contract-violation class, and put a read-after-write check on anything that moves money or changes account state. That is a few days of work, and it converts your most expensive class of production bug from invisible to loud. For a broader map of what to harden before and after launch, our AI engineering playbook covers the surrounding practices, and the LLM API resilience patterns post handles the failures that do throw.

Frequently asked questions

What is a silent failure in an AI agent?

A silent failure is any error whose signal never reaches a human in actionable form. The tool call returns, no exception fires, and the agent produces a fluent, confident, wrong result from bad or stale data. Because nothing crashes and the trace looks green, these failures ship to users and are usually found only when a customer complains.

Why does not standard error handling catch these?

Standard error handling assumes a caller that inspects what it got back and propagates failures. An LLM never does this; it generates plausible output from whatever context it has, including a broken tool response. Try/catch protects your code from crashing, but it does nothing about the model treating a stale or empty value as fact.

How do I detect silent failures if nothing throws?

Assert invariants at the tool boundary and alert on the failure class that looks like success. Contract tool outputs the way you contract inputs, check that results are fresh, non-empty, and in range, and for state-changing tools read back the resulting state instead of trusting the call. Then route those contract violations and 200-but-empty cases to an alert, not just a log.

Is this the same as observability for agents?

No. Observability tells you an agent ran and what steps it took. Silent-failure defense tells you whether those steps were correct. You need both: tracing to see the sequence of calls, and invariant asserts plus output contracts to catch the calls that returned successfully but wrong.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.