Measure latency and time-to-first-token
The Currai team, Engineering — Apr 10, 2026
For a streaming LLM app, total response time is the wrong number to optimize. A user doesn't feel the last token; they feel the wait before the first one. Two responses can take the same total time, and the one that started streaming sooner will feel twice as fast. To optimize what users feel, you have to measure time-to-first-token (TTFT), not just total latency.
Capture both timings
The generation already records total duration. For streaming calls, mark the moment the first token arrives so TTFT becomes its own number.
Now every generation carries both the total it took and the moment it started talking.
Read the percentiles, not the average
Latency is a distribution, and the average lies. A p50 of 800ms with a p95 of 6 seconds means most requests are fine and a meaningful slice are painful — and the average splits the difference into a number that describes neither. Always look at p50, p95, and p99 together.
Where the time usually goes
When TTFT is high, the culprit is almost always before the model starts generating:
- Fat prompts — the model has to read a giant context before it can emit a token. A nested trace shows the prompt size next to the TTFT.
- Slow retrieval — if you build the prompt from a vector search, a slow search delays the whole generation. Span the retrieval separately to see it.
- Cold starts — the first request to a scaled-to-zero worker pays a startup tax. Filter those out or warm the pool.
Tie latency to cost and quality
The fastest model isn't free, and the cheapest isn't always fast enough. Because your traces carry latency, cost, and quality scores together, you can make the trade-off with data: drop to a smaller model on the latency-sensitive path, keep the larger one where quality matters, and watch all three numbers move. That's the difference between guessing at "fast enough" and knowing it.
currai