← Back to writing

Keep your AI feature up when the LLM API fails

Every LLM provider has an outage eventually. Rate limits trip during a traffic spike, a region goes down, a model gets deprecated with two weeks notice, or latency quietly creeps from 800ms to 6 seconds. If your SaaS feature calls one provider with a naive try/catch, every one of those events is a customer-facing incident. The feature that worked in the demo becomes the feature that takes down a workflow your customers depend on.

Resilience for AI features is not exotic. It is three well-understood patterns (retries, fallbacks, and circuit breakers) applied with discipline. The mistake most teams make is reaching for retries alone, implementing them badly, and skipping the other two. Here is the resilience layer we put around production LLM calls.

First, know which failure you are handling

The single biggest resilience mistake is treating every error the same. Before any retry logic, classify the failure.

Transient failures clear in seconds: a 429 rate limit, a brief network glitch, a momentary 503. These are retry territory. Persistent failures do not go away by waiting: a full provider outage, a model returning garbage for 20 minutes, a suspended account, a 400 because your request is malformed. Retrying these burns money and accomplishes nothing except hammering an endpoint that cannot help you.

The rule: retry transient failures, fall back or fail fast on persistent ones. A 429 with a Retry-After header is retryable. A 400 is a bug in your code and retrying it just repeats the bug.

Retries, done correctly

Retries without jitter are how a minor rate limit becomes a self-inflicted outage. When a 429 hits, a naive loop fires again immediately, hammers the already-overloaded endpoint, exhausts its retry budget in milliseconds, and produces no recovery window. Three things make retries actually work.

Exponential backoff with jitter

Wait longer after each failure (1s, 2s, 4s) and add randomness so a thousand clients that all failed at the same instant do not all retry at the same instant. Jitter is what prevents the thundering-herd retry storm that takes a recovering provider back down.

Honor the provider's headers

When the API tells you Retry-After or sends x-ratelimit-remaining and x-ratelimit-reset, obey it. The provider knows when it will accept your next request better than your backoff curve does. Reading those headers turns a guess into a contract.

Budget your retries

Cap total retry attempts and total retry time. Two retries on a user-facing request, not ten. A retry budget means a single bad request cannot spend 30 seconds of a user's patience before giving up. For a streaming feature this matters even more, because the user is watching, which is why we treat the loading state as part of the feature in streaming LLM responses as a SaaS feature.

Fallbacks: have somewhere to go

Retries help with transient blips. They do nothing for a provider that is simply down. For that you need a fallback chain: an ordered list of alternatives the system tries when the primary fails.

A real production chain mixes same-provider and cross-provider options so you survive both a single model problem and a whole-provider outage. A typical shape is primary model, then a cheaper model from the same provider, then a comparable model from a different provider, then a self-hosted model as the no-external-dependency floor. The cross-provider hop is the one that saves you during a real outage, because a same-provider fallback is useless when the whole provider is dark.

Two things make fallbacks safe rather than dangerous. First, your prompts and parsing must work across models, which is another reason to enforce structure at the schema level rather than per-model prompt hacks (see getting reliable structured outputs). Second, you must log which model actually served each request, because a silent fallback to a weaker model can degrade output quality in ways your users notice before your dashboards do. Cost shifts too when you fail over, which ties into the tradeoffs in comparing Bedrock and OpenAI API cost in production.

Circuit breakers: stop digging

If a provider has failed your last five calls, the sixth call is almost certainly going to fail too. A circuit breaker tracks recent failures and, once a threshold is crossed, opens: it rejects calls immediately instead of waiting for each to time out. After a cooldown (say 60 seconds) it goes half-open and lets one test request through. Two successes and it closes again; another failure and it reopens.

The point is to stop wasting time and money on a service that is clearly broken, and to fail fast so your fallback kicks in immediately instead of after a 30-second timeout per request. For LLM apps, the standard triggers (consecutive failures, error rate, p95 latency) are necessary but not sufficient. Add two LLM-specific ones: cost per request exceeding a ceiling, and runaway conversation turn counts in agentic flows. Breaking the circuit on cost has saved more than one team from a prompt-loop bug that would have run up a five-figure bill overnight. We build the same guardrails into long-running agents, as described in the agentic RAG control loop.

Build it or buy a gateway

You can implement all of this in application code, and for a single feature that is often the right call: it is a few hundred lines and you control every decision. As you add more AI features, a gateway that fronts multiple providers behind one endpoint with built-in fallback routing, retries, and budgets removes the duplication. The build-versus-buy line is usually drawn at the second or third feature; before that, hand-rolled is simpler. After that, the gateway pays for itself in one avoided incident.

Whichever you choose, the resilience layer is not optional for a feature customers pay for. It is the difference between a provider outage being a non-event and being a Monday-morning postmortem. The same numbers-based definition of safe applies after launch, which is why we pair this with the monitoring in what to measure after an AI feature ships.

Frequently asked questions

How many times should I retry a failed LLM call?

For a user-facing request, two retries with exponential backoff and jitter is a sane default, capped by a total time budget of a few seconds. Background jobs can afford more attempts over a longer window. Always honor the provider's Retry-After and rate-limit headers rather than relying only on your backoff curve, and never retry non-transient errors like a 400.

Do I really need a multi-provider fallback?

If the feature is critical to a customer workflow, yes. Same-provider fallbacks (a cheaper or faster model) handle single-model issues, but only a cross-provider fallback survives a full provider outage. If the feature is low-stakes and a brief failure is acceptable, a single provider with good retries may be enough. Match the resilience to the cost of downtime.

What triggers should open a circuit breaker for an LLM feature?

Start with the standard triggers: a count of consecutive failures, an elevated error rate, or p95 latency past a threshold. For LLM features, add cost per request exceeding a ceiling and excessive conversation turns in agentic flows. These two catch runaway-loop and runaway-cost bugs that latency-based breakers miss.

Should I build resilience in code or use an LLM gateway?

For your first AI feature, application code is simpler and gives you full control over retry, fallback, and breaker logic. Once you run two or more features against multiple providers, a gateway that centralizes fallback routing, retries, and budgets removes duplicated logic and is usually worth adopting.

Most teams discover they need this layer the hard way, during an outage, after the feature is already live. If you are adding an AI feature to a product customers depend on and want the resilience built in from day one, that is exactly the kind of work we ship in days, not quarters. See what we build or look at how we ship features safely in shipping an AI feature without breaking production.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.