AI agent tools: why tool calling fails in production

When a production AI agent misbehaves, the model is rarely the problem. The failure is almost always in the tool layer: a function that accepts ambiguous input, returns an unstructured error, runs a side effect twice on a retry, or exposes an action the agent should never have been able to take. The fix is to design every tool an agent can call like an API product, with strict contracts, idempotency, structured errors, and an orchestration layer that decides what the agent is allowed to do.

The model picks which tool to call and what arguments to pass. Everything after that point is ordinary software you control. That is good news, because it means the reliability of your agent is an engineering problem with known answers, not a property you have to pray the next model release will fix.

The tool layer is the failure surface, not the model

Teams spend weeks tuning prompts and swapping models when their agent flakes in production. Then they add tracing and find the real pattern: the model called the right tool, but the tool did something the model could not recover from. A search function returned a raw stack trace instead of a usable error. An order tool charged a customer twice because the agent retried a call that had actually succeeded. A reporting tool returned 4,000 rows and blew the context window before the agent could summarize anything.

None of those are reasoning failures. They are missing contracts, missing idempotency, and missing guardrails - the same defects you would catch in a code review of any internal API. The difference is that an LLM is a far less forgiving caller than a human developer. It will not read your docs, it will not notice that a date is in the wrong format until the tool rejects it, and it will happily loop on a tool that keeps failing in the same way. Your tools have to teach the agent how to use them, in band, on every call.

Design every tool like an API product

The most reliable agents treat each tool with the same discipline you would give a public API: a narrow purpose, a typed contract, and feedback the caller can act on. Three rules carry most of the weight.

Narrow, single-purpose tools beat one god-tool

A single manage_account tool that takes an action string and a free-form payload looks economical, but it pushes all the branching into the model, where you cannot test it. Tool-selection accuracy drops as the surface area of each tool grows. Split it: get_account, update_billing_email, cancel_subscription. Each has a tight signature the model can fill correctly on the first try, and each maps to one auditable code path. When you have dozens of tools, group them and select hierarchically rather than dumping all of them into one prompt.

Validate the contract before the tool runs

Define every argument with a schema and reject calls that do not match before any work happens. In a TypeScript service that means a Zod schema on the boundary; in Python, a Pydantic model. The validation layer is not optional politeness - it is how you stop a malformed call from reaching your database. Bad input should fail fast at the edge with a message the model can use, not halfway through a transaction.

Return structured errors the model can recover from

A generic "500 error" tells the agent nothing, so it retries the same broken call. A structured error tells it exactly what to fix. Compare "request failed" with "field 'start_date' got '06/30/2026'; expected ISO format YYYY-MM-DD; retry with '2026-06-30'." The second one turns a dead end into a one-step correction. Treat error messages as a first-class part of the tool's interface, written for a machine reader that will act on them literally.

Make side effects safe with idempotency keys

Retries are not an edge case for agents - they are the normal control flow. A timeout, a transient 503, or a re-plan all cause the agent to call a tool again. If that tool has a side effect, a naive retry double-charges a card, sends a duplicate email, or creates two refunds.

The pattern is the same one payment APIs have used for years. The orchestration layer generates an idempotency key for each logical operation and passes it as a tool argument. The tool records the key with the result on first execution; a repeated call with the same key returns the stored result instead of running the operation again. Generate the key in the framework, not inside the tool, so a retry carries the same key as the original attempt. With that one move, "the agent called refund twice" stops being an incident.

Constrain what the agent is allowed to call

Do not let the model decide whether an action is permitted. That decision belongs in an explicit orchestration layer - a state machine, a workflow step, or a plain policy check - that sits between the model and your tools. The model proposes a tool call; your code decides whether the current state, the current user's permissions, and the request context allow it.

This is also the line where reliability and security meet. The same boundary that stops an agent from cancelling a subscription it was never asked to cancel is the one that contains a prompt-injection attempt, which is why we argue you should defend the execution layer rather than the model. Scope every tool to the acting user, gate destructive tools behind an explicit confirmation step, and keep an allow-list of which tools are reachable in which state. Our AI agent security checklist walks through the permission and logging items in order.

A worked example: a B2B billing agent

Take a support agent for a SaaS billing product. A customer asks, "cancel my extra seats and refund the difference for this month." Numbers below are illustrative, but the shape is real.

The first version shipped with two broad tools, modify_subscription and issue_refund, each taking a free-form JSON blob. In testing it looked fine. In production it cancelled the wrong plan tier when the customer had two subscriptions, because the model guessed at an ambiguous field, and on one retry it issued the refund twice when the first call timed out after the charge had gone through. Two incidents in the first week, both from the tool layer.

The fix touched no prompt. The team split the tools into list_subscriptions, remove_seats, and issue_refund, each with a strict schema. issue_refund took an idempotency key and a specific subscription_id rather than a description. An orchestration step required a list_subscriptions call first when the account had more than one plan, so the agent resolved the ambiguity from data instead of guessing. A policy check capped any single refund and routed anything above the cap to a human. Post-change, the duplicate-refund class of bug went to zero, and the wrong-plan errors disappeared because the ambiguous field no longer existed.

That is the whole thesis in one example: the model was competent the entire time. The reliability came from the tools around it. The same iterate-then-stop discipline applies to retrieval, which we cover in treating agentic RAG as a control loop with hard stop conditions.

Instrument tool calls, then gate on them

You cannot fix a tool layer you cannot see. Emit a trace span for every tool call with the arguments, the result or error class, the latency, and the retry count. OpenTelemetry-style spans make the agent's actual behavior - not your mental model of it - the thing you debug. Most of the "the agent is broken" reports we triage turn out to be one tool returning the wrong shape under a specific input, visible in seconds once the spans exist.

Tracing tells you what happened on a single run; it does not tell you whether a change made the agent better. For that you need offline evaluation on a fixed set of tasks, which is the distinction we draw in observability versus evals for AI agents. Build a small suite of representative tool-use scenarios, assert on the calls the agent makes and the final state it reaches, and run it in CI so a model swap or a tool change cannot regress behavior silently. If you are exposing tools over the Model Context Protocol, the same contract and validation rules apply at the server boundary; our guide to building an MCP server shows where they sit.

Where to start this week

You do not need to rebuild the agent. Pick the one tool with a side effect that worries you most and give it three things: a strict input schema, a structured error format, and an idempotency key. Then add a trace span so you can see it run. That single tool, hardened, usually removes the loudest class of production incident, and it gives you a template for the rest. For the broader sequence - contracts, evals, observability, and rollout - our AI engineering playbook lays out the order we follow on client work.

FAQ

Is tool calling the same as function calling?

They refer to the same mechanism. "Function calling" was the original term for a model returning a structured request to run a named function with typed arguments; "tool calling" and "tool use" are the broader names now common across providers. The engineering concerns - contracts, validation, idempotency, permissions - are identical whatever you call it.

Should I expose my whole API to the agent as tools?

No. Expose narrow, task-shaped tools, not a one-to-one mapping of every endpoint. A god-tool that forwards arbitrary requests pushes branching into the model where you cannot test it, and it widens the blast radius of a bad call. Curate a small set of single-purpose tools scoped to what the agent actually needs to do.

How do I stop an agent from looping on a failing tool?

Two defenses together. Return structured errors so a fixable failure becomes a one-step correction instead of a repeated dead end, and cap retries in the orchestration layer so an unrecoverable error stops the loop and escalates to a human rather than burning tokens.

Does the Model Context Protocol remove the need for tool design?

No. MCP standardizes how tools are discovered and called across clients, which is useful, but it does not validate your inputs, make your side effects idempotent, or decide what an agent is permitted to do. Those remain your responsibility on the server side, exactly as they would for any API.

Why your AI agent fails on tools, not the model (and the fix)

The tool layer is the failure surface, not the model

Design every tool like an API product

Narrow, single-purpose tools beat one god-tool

Validate the contract before the tool runs

Return structured errors the model can recover from

Make side effects safe with idempotency keys

Constrain what the agent is allowed to call

A worked example: a B2B billing agent

Instrument tool calls, then gate on them

Where to start this week

FAQ

Is tool calling the same as function calling?

Should I expose my whole API to the agent as tools?

How do I stop an agent from looping on a failing tool?

Does the Model Context Protocol remove the need for tool design?

Rather we just build it?

Why your AI agent fails on tools, not the model (and the fix)

The tool layer is the failure surface, not the model

Design every tool like an API product

Narrow, single-purpose tools beat one god-tool

Validate the contract before the tool runs

Return structured errors the model can recover from

Make side effects safe with idempotency keys

Constrain what the agent is allowed to call

A worked example: a B2B billing agent

Instrument tool calls, then gate on them

Where to start this week

FAQ

Is tool calling the same as function calling?

Should I expose my whole API to the agent as tools?

How do I stop an agent from looping on a failing tool?

Does the Model Context Protocol remove the need for tool design?

Keep reading

Rather we just build it?