Shipping a streaming AI feature: SSE, partial output, and aborts
A non-streaming AI feature makes the user stare at a spinner for eight seconds and then dumps a wall of text. The same model call, streamed token by token, feels fast because the user sees output within a few hundred milliseconds. Streaming is the single highest-leverage UX change for most LLM features, and it is also where a lot of edge cases hide. This is what it takes to ship streaming properly inside a SaaS product, not just in a demo.
Pick a transport: SSE by default
Server-Sent Events (SSE) is the default for LLM streaming. It is a one-directional stream of text events over a normal HTTP connection, it is what the major model APIs already emit, and it survives the proxies and load balancers that mangle WebSocket upgrades. For a chat or generate feature where the server talks and the client listens, SSE is the least code and the fewest failure modes.
Reach for WebSockets only when you genuinely need bidirectional, low-latency traffic: live tool-call progress that the user can interrupt mid-step, multi-user collaboration on the same session, or a voice interface. The cost is that you now own connection state, heartbeats, and reconnection logic that SSE mostly handles for you.
The latency number that matters
Total generation time barely moves with streaming; perceived speed moves a lot. The metric to instrument is time to first token, the gap between the request and the first visible character. Anything under roughly 300 to 700 ms reads as snappy. We go deeper on this in our note on LLM inference latency and time to first token; for the feature work, the rule is simple: optimize first paint, then total throughput.
Render partial output without breaking the page
Streaming text is not clean text. The model emits half-finished words, an open code fence with no close, a markdown table that is three rows in, or a JSON object missing its final brace. If you pipe raw deltas straight into a markdown renderer, the UI flickers and occasionally throws.
The fix is to buffer and render defensively:
- Accumulate tokens into a running string and re-render from that string, rather than appending raw fragments to the DOM.
- Detect an unclosed code fence or inline span and close it temporarily for display, so a half-open block does not swallow the rest of the page.
- If you stream structured output such as JSON, parse incrementally with a tolerant parser and only act on a field once it is complete.
- Throttle re-renders to animation frames so a fast stream does not pin the main thread.
Handle the cases demos skip: abort, errors, reconnect
A real feature has to deal with users and networks that do not cooperate.
Abort
Give the user a visible stop button and wire it to actually cancel the request: an AbortController on the client plus propagation to the model call on the server, so you stop paying for tokens nobody will read. A stop that only hides the UI but keeps generating is a quiet cost leak.
Errors mid-stream
A stream can fail after it has already painted half an answer. Decide the behavior on purpose: keep the partial text and show an inline error, or roll back, but never leave a truncated answer that looks complete. Send a terminal event from the server (a final done or error message) so the client can tell a finished stream from a dropped one.
Reconnect and resume
Mobile networks drop. If the connection breaks mid-generation, the client should reconnect and either resume from the last received offset or restart cleanly, not silently stall on a half-answer. The built-in last-event-id mechanism in SSE helps here if you assign ids to events.
Treat streaming as part of the feature spec, not a polish step
Streaming touches the model call, the API layer, and the front end at once, so it is easy to under-scope. Fold the streaming behavior into the acceptance criteria for the feature: first paint under your time-to-first-token target, a working stop button, defined error and reconnect behavior, and correct rendering of partial markdown and code. We treat this kind of definition of done as part of writing AI feature acceptance criteria, and it is the same discipline that keeps AI features from breaking production when they ship.
Frequently asked questions
SSE or WebSockets for streaming LLM responses?
Use SSE for one-way token streaming, which covers most chat and generation features. It needs less code and survives proxies better. Choose WebSockets only when you need bidirectional realtime traffic such as interruptible tool progress or multi-party collaboration.
Does streaming actually make the model faster?
No. Total time is about the same. Streaming improves perceived speed by showing output during the time-to-first-token window, which users read as roughly 40 percent faster even when total latency is identical.
How do I render markdown that is still streaming?
Buffer tokens into one string, re-render from that string on each frame, and temporarily close any open code fence or span so a half-written block does not break layout. Parse streamed JSON incrementally and act on a field only once it is complete.
What happens if the connection drops mid-stream?
Send a terminal event when a stream completes so the client can distinguish done from dropped, assign event ids so the client can reconnect and resume, and never leave a truncated answer presented as finished.
Streaming is a small change in API surface and a large change in how an AI feature feels. If you want senior engineers to build the end-to-end streaming path for your product, see what we build.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.