What your AI feature does when the model fails
Your AI feature works in the demo. Then a provider has a bad afternoon: the API returns 500s, then 429s, then answers that arrive 12 seconds late. The question every SaaS team should answer before launch is not whether the model is smart enough. It is what the feature does the moment the model stops behaving.
Most AI features have no answer. They call the model, await a response, and render it. When the call fails, the user gets a spinner that never resolves or a raw error string. That is the difference between a feature that survives a provider incident and one that becomes a support queue.
- Failover swaps the provider on transient network and 5xx errors. Fallback swaps the model, the prompt, or the whole behavior on semantic failures like rate limits, guardrail blocks, or slow responses.
- You need both, plus a plan for what the feature shows when neither recovers the answer in time.
- Tell users when they are getting a degraded result. Silent downgrades erode trust faster than an honest error.
Failover and fallback are not the same thing
People use the words interchangeably and then build only half the system. Keep them separate because they trigger on different signals and need different code paths.
Failover is about reachability. A connection times out, DNS flaps, or the provider returns a 503. The fix is to retry the same request against a different endpoint or a mirrored provider. The output you expect is identical, so the switch can be invisible to the user.
Fallback is about the response being unusable even though the call technically succeeded. The provider returns a 429 rate limit, a content filter blocks the output, the context window overflows, or the answer is materially worse than usual. Retrying the same call harder does not help. You need a cheaper model, a shorter prompt, a cached answer, or a non-AI path.
A single retry-with-backoff loop cannot cover both, because the correct action for a 503 (retry the same thing elsewhere) is the wrong action for a 429 (back off and switch to something lighter). Branch on the failure class first, then act.
The failure modes you actually have to handle
Hard errors and timeouts
5xx responses and dropped connections are the easy case. Retry with exponential backoff and jitter, cap the total wait against your latency budget, and if you run more than one provider, send the retry to the backup. Set the timeout deliberately: a 30 second default means a stalled provider holds your request thread for 30 seconds per user during exactly the incident when you have the most users retrying.
Rate limits
A 429 is not a failure to retry blindly. Honor the provider Retry-After header, then either queue the request or drop to a model with separate capacity. During a spike, the worst thing you can do is retry every rejected request immediately, which turns one rate limit into a self-inflicted denial of service. Token-bucket your own outbound calls so you shed load before the provider does it for you.
Quality regressions and guardrail blocks
The output arrives, parses, and is wrong. A model update shifts behavior, or a safety filter refuses a legitimate request. These never show up as HTTP errors, so you only catch them if you validate the output. A lightweight check on the parsed result, schema validation for structured output, or a quick eval on a sample of live responses gives you the signal. Treat repeated validation failures as a trigger to fall back, not just to log.
Partial streaming failures
If you stream tokens to the UI, the connection can die mid-answer. Now the user has half a sentence and no way to know it is incomplete. Track completion server-side, mark truncated responses, and give the user a retry that resumes or regenerates rather than leaving a dangling fragment. We covered the streaming mechanics in streaming LLM responses as a SaaS feature; the failure path is the part teams skip.
A degradation ladder that keeps the feature usable
Design the feature as a ladder of decreasing capability rather than a binary works-or-broken. Each rung should still return something a user can act on.
Rung one is the primary model on the primary provider. Rung two is the same request on a backup provider or a comparable model, used for failover and hard rate limits. Rung three is a cheaper or smaller model with a tightened prompt, which trades some quality for capacity when the good models are saturated. Rung four is a cached or precomputed answer for common queries, which costs nothing and cannot fail. Rung five is the honest non-AI path: show the raw search results, the last known summary, or a form that lets the user do the task manually.
Picking the right rung per request is a routing decision, and the same infrastructure that saves money by routing easy queries to cheap models also gives you the fallback path for free. If you have not built that layer, model routing to cut costs is the natural place to add it, because a router already knows how to send a request to a different model on demand.
Tell the user when the AI is degraded
The instinct is to hide failures so the product looks flawless. Do the opposite when the fallback changes what the user is getting. If rung two returns a different model with similar quality, staying silent is fine. If rung five drops the user from an AI answer to a keyword search, say so, because a user who thinks they got the AI result and acts on a worse one blames you later.
A small, plain status line does the job: this answer used a backup system and may be less detailed. That one sentence converts a silent quality drop into an informed choice, and it stops the support tickets that start with the feature is broken when the feature was actually degraded on purpose.
What to instrument
You cannot manage a fallback system you cannot see. Emit a metric for every rung: how often each fires, the latency of each, and the cost per hop. An AI gateway should detect provider degradation within roughly 5 to 15 seconds of the first bad signal, before the next wave of user requests hits the unhealthy upstream. A common circuit-breaker starting point from production deployments is 5 consecutive failures to trip open, a 60 second cooldown before a test request, an alert at a 5 percent error rate, and a page at 15 percent.
Watch the cost side too. An agent that retries a failed tool call can multiply its token spend in a loop, so meter tokens per provider and per fallback hop. Route these signals into the same place you already watch model quality; if you are standing that up, observability versus evals for AI agents draws the line between what to monitor live and what to catch in testing. And because a fallback is a production behavior with real edge cases, treat it like any other launch surface: our notes on shipping AI features without breaking production cover the deploy-time guardrails that keep a fallback path from becoming its own incident.
Frequently asked questions
Do I need a second LLM provider to have a fallback?
No. A second provider gives you the strongest failover for outages, but a fallback ladder can run entirely on one provider using a cheaper model, a cached answer, and a non-AI path. Start with what you have and add a backup provider when a single-vendor outage is a real risk to your service level agreement.
How do I decide the timeout for a model call?
Work backward from the latency your feature promises the user. If the interaction should feel responsive within a few seconds, a 30 second model timeout is far too generous; it just holds resources during an incident. Set a tight timeout, and let the fallback ladder catch the requests that exceed it rather than making every user wait for the slow path.
Should the fallback answer look identical to the primary one?
Only if its quality is genuinely comparable. A backup model of similar strength can be invisible. A meaningful downgrade, such as dropping from a reasoning model to a keyword search, should be labeled so the user knows to verify before acting on it.
Is a circuit breaker overkill for a small app?
A full breaker library is optional, but the idea is not. Even a simple counter that stops calling a provider after several fast failures and retries it a minute later will keep one bad upstream from stalling every request in your app. Add the thresholds and dashboards as your traffic grows.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.