All posts
GUIDE6 min read

What is LLM observability?

The Currai team, EngineeringMay 27, 2026

Traditional observability answers three questions about a service: is it up, is it fast, and is it throwing errors. Logs, metrics, and traces were built for code whose behavior is deterministic — the same input produces the same output, and a stack trace tells you exactly where things went wrong.

LLM apps break that assumption. The same prompt can produce different completions on every call. A response can be slow, expensive, and wrong all at once while your HTTP layer happily reports 200 OK. LLM observability is the practice of capturing the data that explains what the model actually did — the prompt you sent, the completion you got back, the tokens it cost, and every retrieval and tool call in between — so a non-deterministic system becomes something you can debug.

What you can't see without it

When an LLM feature misbehaves, the symptoms live in places your APM never looks:

  • The prompt that was assembled at runtime, after templating, retrieval, and history were stitched together.
  • The completion the model returned, including the tool calls it chose.
  • The token usage that turned a cheap feature into a budget problem.
  • The latency of each step, so you know whether the model or your retriever was slow.

A screenshot of a bad answer tells you nothing. The trace behind it tells you everything.

The unit of LLM observability is the trace

A trace is one logical unit of work — answering a question, running an agent turn, completing a chat. Inside it, every model call is a generation and every retrieval or function call is a span. Capturing one is a few lines:

from currai import Currai

currai = Currai(public_key="pk-lf-...", secret_key="sk-lf-...")

trace = currai.trace(name="support-answer", user_id="user-1")
gen = trace.generation(name="openai.chat", model="gpt-4o-mini", input=messages)
gen.end(output=reply, usage={"input": 312, "output": 88})

That single trace is replayable: you can open it later and see exactly what the model saw and said.

Observability is not evaluation — but it feeds it

Observability tells you what happened. Evaluation tells you whether it was good. They are different jobs, but they share the same data: the traces you capture in production become the dataset you score offline and the baseline you compare new prompts against. Capture first; you can always grade later.

Where to start

You don't need an evals strategy or a dashboard plan on day one. You need one trace flowing. Wrap your hottest LLM call, ship it, and watch real prompts and completions land. Once the data is there, cost roll-ups, latency percentiles, and quality scores are each one extra argument away — and you're no longer guessing about a system you can finally see.