Fix a slow LLM feature: cut time to first token
A user clicks the button, then stares at a spinner for three seconds before any text appears. The model is fine. The answer is good. The feature still feels broken, because the part a person experiences as speed is not how long the whole response takes; it is how long until something starts happening. That gap has a name, and optimizing for it is usually cheaper and faster than people expect.
This is the practitioner version: what time to first token actually measures, why it dominates perceived speed, and the levers that move it in production.
Time to first token is the metric that matters
Two latency numbers describe an LLM response. Time to first token (TTFT) is how long from sending the request until the first token comes back. Inter-token latency, or tokens per second, is how fast the rest streams after that. Total latency is roughly TTFT plus the time to stream every output token.
For anything a person waits on, TTFT is what they feel. A response that starts in 400ms and streams over four seconds feels fast, because the user is reading from the half-second mark. A response that takes two seconds to start and then dumps the full answer instantly feels slow, even if the total time is identical. Optimizing total latency while ignoring TTFT is optimizing the wrong number.
As of 2026, the fastest mainstream APIs for TTFT are models like Gemini 2.5 Flash and Claude Haiku 4.5, both consistently under 600ms on medium-length prompts. That is your floor for an interactive feature: if your measured TTFT is several seconds, the model is rarely the cause. The cause is almost always something in your own pipeline.
Where your latency actually goes
Before blaming the model, account for everything that happens before the model is even called. In a RAG feature, a single user request often triggers an embedding call, a vector search, a re-rank, and only then the generation call. Each step adds latency, and they usually run in sequence when some could run in parallel.
Stop streaming, then re-blocking
The most common self-inflicted wound: the backend streams tokens from the model, then buffers them and sends the whole response to the browser at once. You paid for streaming and threw away its only benefit. Stream end to end, from the model through your server to the client, so the first token reaches the user the moment it exists.
Doing retrieval on the critical path when you do not have to
If the same retrieval runs on every keystroke or every follow-up question, it adds its full cost to every TTFT. Cache stable retrievals, debounce input, and run independent steps concurrently instead of one after another. Often half the perceived latency is retrieval work that did not need to block the first token.
Using a reasoning model for a job that does not need one
Reasoning-optimized models think before they answer, which is the opposite of fast first output. They are the wrong tool for real-time streaming. Reserve them for tasks where accuracy justifies the wait, and route interactive surfaces to a low-TTFT model.
The levers that move it
Once the pipeline is clean, a handful of changes do most of the work.
Model choice is the biggest single lever. Routing the easy, latency-sensitive requests to a fast small model (and only the genuinely hard ones to a larger model) cuts both TTFT and cost. This is the same model-routing pattern that controls spend; we cover the cost side in our breakdown of where AI infrastructure costs actually come from.
Caching helps latency, not just cost. A semantic cache that returns a stored answer for a repeated or near-identical question responds in milliseconds instead of seconds; reported reductions reach around 73 percent on high-repetition workloads. Prompt caching also lowers TTFT because the model skips recomputing the cached prefix.
If you self-host, hardware-level techniques compound. Quantization that makes a model roughly half the size has delivered about 2.2x higher throughput and a 30 percent better TTFT in published results, and FP8 key-value cache quantization halves cache memory so you can double concurrency without a meaningful quality loss. These matter most when you are serving your own weights at volume rather than calling a hosted API.
The order to work in: fix the pipeline first (stream end to end, parallelize, cache retrieval), then pick a low-TTFT model for interactive paths, then reach for hardware tricks only if you self-host. Most teams get the win they need from the first two and never touch the third.
Measure before you optimize
You cannot tune what you do not record. Log TTFT and total latency as separate metrics, at p50 and p95, for every model call, and tag them by the step that produced them. The p95 is where users churn; an average that looks fine often hides a slow tail that is doing the damage. Latency that creeps up over time is also a maintenance signal, one of several we track in our look at the real cost of maintaining AI products.
Done in order, latency work is some of the highest-return effort on a shipped AI feature: no model downgrade, no quality loss, just a faster first token. It is a standard part of how we ship and tune the production AI features we build, and if you need hands on it directly, our senior LLM engineers do this kind of profiling routinely.
Frequently asked questions
What is a good time to first token for an interactive feature?
Aim for under 600ms, which the fastest mainstream models hit on medium prompts. Under one second feels responsive; past two seconds users notice the wait. If your measured TTFT is several seconds, the bottleneck is almost always your pipeline, not the model.
Does streaming reduce actual latency or just perceived latency?
Mostly perceived, and that is the point. Streaming does not make the full response finish sooner, but it puts the first words in front of the user far earlier, so the feature feels fast. The common bug is streaming from the model and then re-buffering on the server, which discards the benefit.
Should I use a reasoning model for a chat feature?
Usually not on the interactive path. Reasoning models deliberate before answering, which raises TTFT. Route real-time chat to a fast model and reserve reasoning models for tasks where the extra accuracy is worth a slower start, ideally off the critical path.
What is the fastest latency win I can ship this week?
Stream end to end and confirm nothing in your stack re-buffers the response, then route latency-sensitive requests to a low-TTFT model. Those two changes need no model quality tradeoff and typically cut perceived latency more than any single infrastructure change.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.