Active observability for LLM apps
Active observability turns production LLM traces into continuous signals for quality, cost, latency, prompts, tools, and evals before users report a problem.
TL;DR: active observability is what happens when LLM observability stops being a place you check after an incident and becomes a system that continuously surfaces what needs attention. Production traces are the raw material: prompts, completions, tool calls, retrieval steps, token usage, latency, cost, users, sessions, and prompt versions all stay connected so teams can find quality problems before users report them.
Passive observability answers "what happened when I went looking?" Active observability answers "what changed, what is risky, and what should we inspect next?"
Why LLM observability has to become active
Traditional observability works well when the failure mode is a service going down, an endpoint getting slow, or an exception rate climbing. Those signals are still useful in AI products, but they miss the failures that matter most to users.
An LLM endpoint can return 200 OK while the answer is wrong, unsupported,
expensive, slow, unsafe, or based on the wrong retrieval context. An agent can
finish successfully after calling the same tool six times. A prompt version can
improve one task and quietly regress another. A RAG pipeline can keep latency flat
while retrieval quality drifts because the underlying documents changed.
The hard part is volume. A team can manually inspect ten traces. It cannot manually inspect every chat turn, support answer, agent step, retrieval span, prompt experiment, and judge score flowing through production. At production scale, observability has to do more than preserve evidence. It has to help decide which evidence deserves attention.
That is the shift from storing traces to active observability.
What active observability means
Active observability is the practice of continuously turning production behavior into useful signals. For LLM apps, that means trace data is not just archived for debugging later. It is grouped, filtered, scored, compared, and reviewed so the team can see patterns that would otherwise hide in the long tail.
Useful active signals look like this:
- A prompt version has lower eval scores than the previous version on real traffic.
- A model change lowered latency but increased hallucination risk on grounded answers.
- A subset of sessions contains repeated tool errors even though the final request status is successful.
- A RAG feature has stable cost but declining retrieval relevance.
- A small group of users or tenants is driving most high-token traces.
- An agent run often hits the step limit after a specific tool returns empty output.
The point is not to let automation make every product decision. The point is to make the next human review sharper. Instead of opening random traces, the team starts from the traces that show a pattern, outlier, regression, or recurring failure mode.
The trace is the raw material
Active observability only works if the underlying trace is complete enough to explain the behavior. A thin log line with a request ID and response text is not enough. You need the full path of the LLM request.
In Currai, a trace is one logical operation: a chat turn, a RAG answer, an agent run, or another product action. Inside the trace, each model call is a generation, and non-model work such as retrieval, tool calls, routing, MCP actions, parsers, and guardrails are spans.
That shape matters because the failure usually lives between steps. If an answer is wrong, the question is not just "what did the model say?" It is:
- What prompt and prompt version produced the answer?
- What retrieval query ran, and which documents came back?
- Which tool calls happened, with what input and output?
- Which model was called, with what parameters?
- How many tokens were used, and where did latency accumulate?
- Which user, session, tenant, environment, or experiment arm did this belong to?
- Did nearby turns in the same session set up the failure?
When those fields stay connected, the trace becomes more than a debugging record. It becomes a unit of evidence that can be searched, rolled up, evaluated, and compared.
From traces to evals
The most direct path from observability to active observability is evals.
Static eval sets are useful, but they go stale. Users change how they ask questions, documents change, tools change, prompts change, and models change. Production traces keep the dataset close to the behavior your product actually served.
The active loop looks like this:
- Capture production traces with prompt names, prompt versions, user IDs, session IDs, tags, model details, token usage, latency, and outputs.
- Select recent traces for a prompt, feature, agent, or high-risk workflow.
- Score the outputs with a narrow rubric such as groundedness, format adherence, policy compliance, task completion, or answer helpfulness.
- Compare the scores by prompt version, model, retrieval strategy, experiment arm, user segment, or time window.
- Open the failing traces to understand whether the prompt, retrieval, tool output, model choice, or product flow caused the problem.
- Turn repeated failures into future regression cases.
That last step is where the workflow becomes durable. A bad trace should not only explain yesterday's incident. It should become a check that catches the same failure mode before the next prompt or model change ships.
What to monitor continuously
Not every metric deserves a dashboard. Active observability works best when the signals map to decisions the team is willing to make.
For most LLM apps, start with these:
- Quality regressions: eval score drops by prompt version, model, feature, or traffic segment.
- Hallucination and grounding failures: answers unsupported by retrieved documents, tool output, or known source data.
- Retrieval misses: low relevance, empty results, stale chunks, or context that does not match the user question.
- Tool failures: repeated errors, empty outputs, invalid arguments, or tools selected for the wrong task.
- Agent loops: repeated actions, rising step counts, step-limit exits, or growing context that drives token cost up.
- Latency outliers: slow generations, slow tools, slow retrieval, retries, or multi-step workflows where one span dominates the request.
- Cost spikes: high-token traces, expensive model routing, repeated calls, long sessions, or prompt versions that added unnecessary context.
- Session-level failures: individual turns that look acceptable but become poor user experiences when replayed as a conversation.
The important detail is that each signal should point back to the trace. A chart that says quality dropped is useful. A chart that lets you open the exact prompts, responses, retrieval spans, tools, and sessions behind the drop is much more useful.
How Currai supports the active loop
Currai is built around the data active observability needs: production traces with prompts, completions, generations, spans, tools, retrieval, tokens, cost, latency, sessions, users, tags, metadata, prompt versions, and eval results in one workflow.
You can send traces through the Currai SDKs, use Langfuse-compatible instrumentation, or export OpenTelemetry spans over OTLP. The ingestion path matters less than the shape of the data: one request becomes a trace, model calls become generations, and the surrounding work becomes spans.
Once the traces exist, Currai can help teams move through the operating loop:
- Inspect a single production trace when a user reports a bad answer.
- Group traces by user or session to replay multi-turn behavior.
- Roll up token usage and cost by model, user, trace, and time window.
- Attach prompt names and versions so prompt changes are measurable.
- Run evals on real outputs instead of relying only on static test cases.
- Compare prompt variants with production A/B tests.
- Trace agents, tools, retrieval, MCP calls, and nested workflows without losing the request tree.
The goal is not observability for its own sake. The goal is a shorter path from "something changed" to "we know why, and we know what to fix."
Start with one active signal
The first version of active observability does not need a giant taxonomy. Pick one production path where quality matters and make the trace complete.
For a support assistant, that might mean tracing the user question, retrieved documents, final answer, prompt version, token usage, latency, user ID, and session ID. Then choose one signal: groundedness failures, high-cost sessions, or answers that violate the required format. Review the outliers every week and turn repeated failures into evals.
For an agent, start with step count, tool calls, tool outputs, final answer, cost, latency, and whether the run hit its limit. The first active signal might be "runs that call the same tool repeatedly" or "runs where a tool returns empty output but the model continues."
The pattern is the same in both cases: capture the real run, find the recurring failure mode, score or tag it, fix the system, and keep the trace around as evidence for the next change.
FAQs
What is active observability?
Active observability is the practice of continuously turning production traces into signals that surface regressions, outliers, quality problems, cost spikes, and recurring failure modes. For LLM apps, that means using prompts, completions, tool calls, retrieval spans, eval scores, latency, token usage, and metadata to decide what needs review next.
How is active observability different from LLM observability?
LLM observability captures what happened inside an LLM app. Active observability uses that captured data to find patterns and drive action: compare prompt versions, run evals on real outputs, detect expensive traces, inspect failed agent loops, and turn repeated failures into regression checks.
What data do you need for active observability?
At minimum, capture the prompt input, model output, model name, token usage, latency, user or session identifiers, and errors. For stronger active signals, also capture retrieval spans, tool calls, prompt names and versions, experiment metadata, tags, environment, cost, and eval scores.
