Apr 10, 2026

Measure latency and time-to-first-token

Total latency hides the metric users actually feel — time to first token. Here's how to capture both on every generation and find what's making your LLM app feel slow.

TUTORIAL5 min readThe Currai team / Engineering

Currai

For a streaming LLM app, total response time is the wrong number to optimize. A user doesn't feel the last token; they feel the wait before the first one. Two responses can take the same total time, and the one that started streaming sooner will feel twice as fast. To optimize what users feel, you have to measure time-to-first-token (TTFT), not just total latency.

Capture both timings

The generation already records total duration. For streaming calls, mark the moment the first token arrives so TTFT becomes its own number.

import time

gen = trace.generation(name="answer", model="gpt-4o", input=messages)
start = time.perf_counter()
first_token_at = None

for chunk in client.stream(messages):
    if first_token_at is None:
        first_token_at = time.perf_counter()
    # ...accumulate

gen.end(
    output=full_text,
    metadata={"ttft_ms": int((first_token_at - start) * 1000)},
)

Now every generation carries both the total it took and the moment it started talking.

Read the percentiles, not the average

Latency is a distribution, and the average lies. A p50 of 800ms with a p95 of 6 seconds means most requests are fine and a meaningful slice are painful — and the average splits the difference into a number that describes neither. Always look at p50, p95, and p99 together.

Where the time usually goes

When TTFT is high, the culprit is almost always before the model starts generating:

Fat prompts — the model has to read a giant context before it can emit a token. A nested trace shows the prompt size next to the TTFT.
Slow retrieval — if you build the prompt from a vector search, a slow search delays the whole generation. Span the retrieval separately to see it.
Cold starts — the first request to a scaled-to-zero worker pays a startup tax. Filter those out or warm the pool.

Tie latency to cost and quality

The fastest model isn't free, and the cheapest isn't always fast enough. Because your traces carry latency, cost, and quality scores together, you can make the trade-off with data: drop to a smaller model on the latency-sensitive path, keep the larger one where quality matters, and watch all three numbers move. That's the difference between guessing at "fast enough" and knowing it.

Back to blog

Measure latency and time-to-first-token

Capture both timings

Read the percentiles, not the average

Where the time usually goes

Tie latency to cost and quality

How to build an AI FAQ chatbot trained on your documentation

How to build a customer support chatbot for your website (step-by-step)

LLM red teaming: a step-by-step guide

Measure latency and time-to-first-token

Capture both timings

Read the percentiles, not the average

Where the time usually goes

Tie latency to cost and quality

Related articles

How to build an AI FAQ chatbot trained on your documentation

How to build a customer support chatbot for your website (step-by-step)

LLM red teaming: a step-by-step guide