How to evaluate stateful AI agents with production traces
Stateful agent evals need more than a final answer score. Currai ties agent steps, tool calls, sessions, cost, latency, and eval results back to the production trace.
TL;DR: stateful agent evals need the whole run, not just the final answer. Stateful AI agents accumulate context, call tools, read live data, retry steps, and make decisions across multi-turn conversations. To evaluate them well, you need production traces that preserve every generation, tool call, retrieval span, session, token, cost, latency, and output that shaped the result.
A final-answer score can tell you an agent failed. The trace tells you why it failed and which part of the system needs to change.
Why stateful agent evals are harder than prompt evals
A prompt eval usually tests one model call. You give the model an input, score the output, and compare versions. That is useful, but it is too small for a stateful agent.
An agent is a loop. It receives a user request, chooses an action, calls a tool, reads the result, updates its working state, and decides what to do next. A single request can include many model calls, tools, retrieval steps, retries, guardrails, and handoffs. The final answer is only the last artifact in a longer chain of decisions.
That makes agent evaluation more complicated:
- State accumulates across steps, so a bad early tool result can poison a good final prompt.
- Tools change the world, which means the agent may need to be judged on whether it should have taken an action at all.
- Live systems fail, so tool errors, empty responses, rate limits, and stale data are part of the evaluation.
- Loops hide cost, because the cheapest model call can become expensive when the agent repeats it ten times.
- Multi-turn conversations drift, so one acceptable answer may still belong to a session where the agent lost the user's goal.
If you only score the final message, all of those failure modes collapse into "bad answer." That is not enough signal to fix an agent.
The trace is the eval unit
For stateful agent evals, the trace should be the unit of review. One trace is one agent run or one turn in the product workflow. Inside that trace, each model call is a generation and each non-model step is a span.
In Currai, that means the eval dataset can include:
- The user input, session ID, user ID, tags, environment, and metadata.
- Every generation with model name, prompt input, completion, parameters, usage, latency, and cost.
- Tool spans with tool names, arguments, outputs, errors, and timing.
- Retrieval spans with the query, selected documents, reranking behavior, and assembled context.
- Prompt names, prompt versions, and experiment metadata.
- Final trace output, status, latency, total tokens, and rolled-up cost.
That structure matters because agent bugs are usually relational. The model chose the wrong tool because retrieval returned irrelevant context. The final answer was unsupported because a tool returned an empty object and the model guessed. The run was expensive because the agent replayed a growing state into every generation.
When the trace keeps those steps together, an eval is no longer judging an isolated string. It is judging a run.
What to score in a stateful agent
Good agent evaluation uses more than one score. Start with the outcome, then add step-level checks for the failures that actually matter in your product.
Useful stateful agent eval dimensions include:
- Final answer quality: Did the answer solve the user's task, follow the required format, and avoid hallucinated claims?
- Tool choice: Did the agent select the right tool for the user's intent?
- Tool arguments: Were the arguments valid, complete, safe, and grounded in the user request?
- Tool output faithfulness: Did the final answer accurately reflect the tool or retrieval output?
- Loop behavior: Did the agent repeat actions, hit a step limit, or keep working after enough evidence existed?
- Handoff behavior: Did the agent escalate, ask a clarification, or stop when automation was not appropriate?
- Retrieval use: Did the agent retrieve relevant context and use it instead of inventing unsupported details?
- Latency: Did the user wait because one model call was slow, or because the whole loop took too many steps?
- Cost per resolved task: How many tokens and dollars did the successful run require?
Do not start with twenty rubrics. Pick two or three that match the agent's job. A support agent might care about answer correctness, tool faithfulness, and handoff quality. A coding agent might care about task completion, command safety, and whether the final patch actually matches the request.
How production traces become an agent eval dataset
Synthetic tests are useful for controlled cases, but production traces show the distribution your users actually create. They include vague requests, long sessions, missing context, unexpected tool outputs, expensive loops, and edge cases no one wrote into a spreadsheet.
A practical workflow looks like this:
- Instrument the agent run as a trace.
- Record each model decision as a generation.
- Wrap tools, retrieval, MCP calls, API requests, and custom work as spans.
- Attach user ID, session ID, environment, tags, prompt version, and experiment metadata.
- Let production traces accumulate for the feature you want to evaluate.
- Filter traces by prompt, agent name, tag, model, cost, latency, error state, user segment, or session.
- Score a representative slice with narrow rubrics.
- Open failed traces and identify the step that caused the failure.
- Turn repeated failures into regression cases for future prompt, model, tool, or routing changes.
The key is that the eval starts from the real run. You do not need to reconstruct what happened from logs or guess which prompt version served the request. The trace already contains the input, output, steps, metadata, usage, and cost.
Cost-efficiency matters
The cheapest model is not always the cheapest system. An agent using a cheap model can still be expensive if it retries too often, loops through tools, grows context every step, or calls a high-latency service that forces another model round trip.
For agent evaluation, cost should be measured at the trace level. Ask:
- How much did the entire run cost, not just the final generation?
- How many model calls did successful runs require?
- How often did failed runs spend almost as much as successful runs?
- Which tool or retrieval step caused repeated retries?
- Which prompt version increased tokens without improving quality?
- What is the cost per resolved task, not the cost per model call?
Currai rolls token usage and cost up from generations into the trace, so expensive agent runs are visible as runs. That is the unit the user experiences, and it is the unit the business pays for.
Multi-turn conversations need session-level review
Some agent failures only appear across a session. A single turn may look fine, but the conversation can still fail because the agent forgot a constraint, repeated a question, contradicted an earlier answer, or followed stale state from three turns ago.
That is why session IDs matter. One trace can explain a turn. A session can explain the conversation.
In Currai, every trace sharing a sessionId can be grouped into the same
conversation timeline. That lets you evaluate whether the agent preserved intent,
used prior context correctly, escalated at the right time, and kept cost under
control across the full interaction.
For multi-turn agent evaluation, score both levels:
- Per-turn checks for answer quality, tool use, retrieval faithfulness, latency, and cost.
- Per-session checks for goal completion, consistency, unnecessary repeated work, user frustration signals, and total cost.
This avoids a common blind spot: optimizing individual answers while the full conversation still feels broken.
A Currai workflow for stateful agent evals
Currai gives stateful agent evals the same starting point as debugging: a production trace with the whole request tree intact.
Start by tracing the agent run:
- Create one trace per agent run or product turn.
- Pass
userIdandsessionIdso runs can be grouped by user and conversation. - Record model decisions as generations with model, input, output, parameters, usage, and prompt version.
- Record tool calls, retrieval, MCP actions, and other work as spans.
- Attach tags and metadata for feature name, environment, tenant, experiment arm, provider, and routing policy.
Then evaluate the traces:
- Filter to the prompt, agent, model, session, or feature you want to inspect.
- Review high-cost, high-latency, error, and step-heavy traces first.
- Run evals on real traced outputs with focused rubrics.
- Compare scores against prompt versions, models, or experiment variants.
- Open failed examples and inspect the exact generation, tool span, retrieval span, or session turn that caused the issue.
That closes the loop between AI observability and LLM evaluation. Traces explain what happened. Evals tell you whether it was good. Together, they give you the evidence to change the prompt, tool, routing policy, retrieval strategy, or model choice with less guessing.
FAQs
What are stateful agent evals?
Stateful agent evals measure AI agents that carry context across steps, call tools, use retrieval, and make decisions over time. They evaluate the whole run: final answer, model decisions, tool calls, retrieval behavior, latency, token usage, cost, and session-level behavior.
How do you evaluate tool-using agents?
Evaluate the final answer and the tool path. Check whether the agent chose the right tool, passed valid arguments, handled tool errors correctly, used tool outputs faithfully, avoided repeated actions, and stopped or escalated when the tool result was insufficient.
Should agent evals use production traces or synthetic tests?
Use both. Synthetic tests are good for known edge cases and release gates. Production traces are better for discovering real user behavior, long-tail inputs, expensive loops, tool failures, prompt regressions, and session-level problems. The best workflow turns failed production traces into future regression tests.
How do you measure agent cost-efficiency?
Measure cost at the trace or session level, not only per model call. Track total tokens, model cost, tool retries, repeated generations, step count, latency, and whether the run actually resolved the task. The useful metric is cost per successful outcome.
