Jul 2, 2026

Demystifying AI agent evals with traces

AI agent evals need more than a final answer score. Learn how to evaluate tool calls, transcripts, outcomes, regressions, non-determinism, and production traces.

GUIDE9 min readThe Currai team / Engineering

AI agent evals are harder than prompt evals because agents do not just answer. They plan, call tools, inspect intermediate results, update state, and decide what to do next. A single user request can become a full trajectory of model calls, tool calls, retrieval steps, browser actions, retries, and final output.

That is why evaluating an AI agent on its final message alone is usually too thin. The answer might be correct for the wrong reason. It might sound polished while ignoring a tool result. It might complete the task but burn ten times the expected tokens. Or it might fail because the environment, grader, or task definition was ambiguous.

Good agent evaluations measure the whole run: the task, the transcript, the tools, the outcome, the graders, and the production behavior you actually care about.

What an agent evaluation is measuring

An evaluation, or eval, is a structured test for an AI system. You give the agent a task, run it in an environment, capture what happened, and grade the result against success criteria.

For a simple LLM call, that can be as small as:

  • Input prompt
  • Model output
  • Grader score

For an AI agent, the evaluation unit is larger:

  • Task: the problem the agent must solve, including inputs and success criteria.
  • Trial: one attempt at the task. Because model behavior is non-deterministic, the same task may need several trials.
  • Transcript: the full sequence of messages, model outputs, tool calls, intermediate results, errors, and decisions.
  • Outcome: the state that exists after the run, such as a ticket resolved, a reservation created, a file changed, or a pull request passing tests.
  • Grader: code, model, or human judgment that scores some part of the run.
  • Eval harness: the infrastructure that starts tasks, runs agents, records transcripts, executes graders, and aggregates results.
  • Agent harness: the scaffold that lets the model act: loop control, tool routing, state management, memory, retries, and final response handling.
  • Evaluation suite: a set of tasks that measures a capability, product flow, or regression surface.

This vocabulary matters because "the agent failed" is not specific enough to fix anything. Did the task fail because the model chose the wrong tool? Because the tool returned stale data? Because the final answer ignored the tool output? Because the grader was too rigid? Because the environment leaked state from a previous trial?

Agent evals are useful when they separate those possibilities.

Why teams build evals for agents

Manual testing works for early prototypes. You run the agent on a few examples, read the answers, tweak the prompt, and try again. That is enough while the surface area is small and the team can remember every important case.

It breaks down once the agent is in production.

Without evals, every change becomes a guess:

  • Did the new model improve quality or just change tone?
  • Did the prompt fix one failure while creating another?
  • Did tool-use accuracy improve on support tickets but regress on refunds?
  • Did cost per resolved task go up?
  • Did the agent become less reliable across repeated trials?

Evals turn those questions into repeatable checks. They give you baselines, regression tests, and a way to compare prompt versions, model upgrades, routing rules, retrieval changes, and tool changes without waiting for users to report that something feels worse.

The compounding value is the important part. Each production failure can become a test case. Each test case protects a behavior. Each protected behavior makes future changes less dependent on intuition.

The three main grader types

Most useful agent evaluation suites combine deterministic graders, model-based graders, and human review. Each one is good at a different job.

Deterministic graders

Deterministic graders use code to verify objective conditions. They are fast, cheap, reproducible, and easy to debug.

Common examples include:

  • Unit tests for generated code.
  • Exact, regex, or fuzzy string checks.
  • Static analysis, linting, type checks, and security scans.
  • Database or API state checks after a tool action.
  • Tool-call checks for required tools or prohibited actions.
  • Transcript metrics such as number of turns, token usage, latency, and error count.

Use deterministic graders whenever the desired outcome can be checked directly. If a coding agent claims it fixed a bug, run the tests. If a support agent claims it processed a refund, check the refund object. If a browser agent claims it submitted a form, verify the backend state.

The weakness is brittleness. A deterministic grader can reject a valid solution if it expects one exact path, one exact string, or one exact sequence of tool calls. For agents, grade the outcome whenever possible, not only the path.

Model-based graders

Model-based graders, often called LLM-as-judge graders, score outputs or transcripts with a rubric. They are useful when correctness is contextual, freeform, or partly subjective.

Good uses include:

  • Instruction following.
  • Helpfulness and completeness.
  • Faithfulness to retrieved sources or tool outputs.
  • Conversation quality and tone.
  • Safety policy adherence.
  • Pairwise comparison between two outputs.
  • Rubric-based scoring across several quality dimensions.

Model graders are flexible, but they need calibration. A vague rubric produces noisy scores. A judge that is not allowed to say "not enough information" may invent reasons. A single all-purpose judge prompt can blur distinct dimensions that should be scored separately.

Treat model graders like production code: test them, read their rationales, compare them with human judgment, and revise the rubric when they reward the wrong behavior.

Human graders

Human graders are slow and expensive, but they are still the reference point for many agent workflows. Domain experts, support leads, product managers, sales teams, clinicians, lawyers, analysts, or trained reviewers often know what good looks like before a rubric does.

Use human review to:

  • Calibrate LLM judges.
  • Audit failures that automated graders cannot explain.
  • Review subjective outputs where expert judgment matters.
  • Spot-check production traces and long conversations.
  • Define the examples that later become automated regression tests.

Human review should not be the only evaluation layer, but removing it entirely usually leads to false confidence.

Capability evals vs regression evals

Agent evaluation suites usually serve one of two purposes.

Capability evals ask what the agent can do. They should include tasks that are currently hard. A good capability suite may start with a low pass rate because it gives the team a hill to climb.

Regression evals ask whether the agent still does what it used to do. They should pass almost all the time. If a regression suite drops, something broke.

Both are necessary. Capability evals help you improve. Regression evals stop you from trading away existing behavior while chasing a new score.

Over time, a capability eval can graduate into a regression suite. Once the agent reliably handles a class of tasks, those tasks should keep running so future prompt, model, and tool changes do not quietly undo the gain.

Evaluating coding agents

Coding agents are often the easiest agents to grade objectively because software has executable checks. The agent receives a task, modifies files, runs commands, and produces a patch. The grader can run tests, lint, type checks, security checks, and repository-specific validation.

A practical coding-agent eval might check:

  • Did the failing tests pass?
  • Did existing tests keep passing?
  • Did static analysis still pass?
  • Did the patch stay within the requested scope?
  • Did the agent avoid unsafe shell commands?
  • Did it update the correct files?
  • Did the final answer accurately describe the change?

For coding agents, correctness should usually come from the repo. LLM rubrics are useful for code quality, maintainability, over-engineering, or instruction following, but they should not replace tests when tests can express the outcome.

task:
  id: fix-empty-password-auth-bypass
  input: Fix the login bug where an empty password can bypass auth.
  graders:
    - type: deterministic_tests
      command: pnpm test auth
    - type: static_analysis
      command: pnpm typecheck
    - type: llm_rubric
      rubric: security_patch_quality
  metrics:
    - n_turns
    - n_tool_calls
    - total_tokens
    - total_latency_ms

The point is not that every coding eval needs this many checks. The point is that the final score should reflect both the working outcome and the behavior you want the agent to preserve.

Evaluating conversational agents

Conversational agents introduce a different challenge: the interaction is part of the product. A support agent can technically resolve a ticket while making the customer repeat themselves, ignoring frustration, or taking too many turns.

Useful conversational-agent eval dimensions include:

  • Did the agent complete the user's task?
  • Did it ask for required information before taking action?
  • Did it use the correct policy, account data, or tool result?
  • Was the tone appropriate for the situation?
  • Did it finish within a reasonable number of turns?
  • Did it escalate when automation was not appropriate?
  • Did the backend state match the claimed resolution?

Many conversational evals need a simulated user. The simulated user can play a persona, withhold information, express confusion, or push the agent toward an edge case. The agent's final answer is then only one artifact in a longer conversation.

For production systems, session-level evaluation is also important. One turn may look fine while the full session shows drift, repeated questions, lost context, or unresolved intent.

Evaluating research agents

Research agents gather information, inspect sources, synthesize findings, and produce answers or reports. Their evals are hard because there may be no single canonical answer, and the ground truth can change.

Good research-agent evals focus on dimensions such as:

  • Source quality.
  • Coverage of required facts.
  • Groundedness of claims.
  • Citation accuracy.
  • Completeness relative to the question.
  • Appropriate uncertainty.
  • Clear separation between evidence and inference.

Use exact-match checks for facts that really are objective. Use model-based rubrics for synthesis quality, source use, and unsupported claims. Use expert human review when the task requires domain judgment.

For research agents, the transcript matters. You need to know which sources were retrieved, which ones were ignored, and where unsupported claims entered the answer.

Evaluating browser and computer-use agents

Browser agents and computer-use agents operate through interfaces built for humans. They click, scroll, type, inspect pages, read screenshots, use desktop apps, and modify state in systems that may not expose clean APIs.

The right grader usually checks the environment after the task:

  • Did the page end in the expected state?
  • Was the correct record created or updated?
  • Did the file system contain the expected artifact?
  • Did a form submit with valid values?
  • Did the agent avoid prohibited actions?
  • Did it use the right interaction mode for the job?

For these agents, screenshots and UI state are useful, but they are not enough. If the task is to place an order, the confirmation page is weaker evidence than the backend order record. If the task is to update a CRM field, inspect the saved record, not only the text on the screen.

Non-determinism is part of the measurement

Agent behavior varies between runs. The same task can pass once and fail the next time because of sampling, tool timing, retrieval variation, flaky environments, or a slightly different reasoning path.

That means one trial rarely tells the full story.

Two useful ways to think about repeated trials are:

  • pass@k: the chance the agent succeeds at least once across k attempts.
  • pass^k: the chance the agent succeeds every time across k attempts.

pass@k is useful when one good solution is enough, such as generating several candidate patches and choosing one. pass^k is useful when the user expects reliable behavior on every attempt, such as a customer-facing support workflow.

For production agents, consistency often matters more than peak capability. An agent that sometimes solves a refund perfectly and sometimes ignores policy is not ready just because one trial looked good.

A practical roadmap for building agent evals

You do not need hundreds of tasks to start. A useful first suite can come from the checks you already perform manually and the failures users already reported.

Start with a small but real set:

  1. Pick one agent workflow that matters.
  2. Convert manual QA cases into task definitions.
  3. Add recent production failures as regression cases.
  4. Write success criteria that two reviewers would grade the same way.
  5. Create at least one reference solution or expected outcome.
  6. Include positive and negative cases so the agent does not overfit one behavior.
  7. Run each trial in a clean environment with isolated state.
  8. Prefer deterministic outcome checks where possible.
  9. Add LLM rubrics only for dimensions that need judgment.
  10. Read failed transcripts before trusting the aggregate score.

The last step is not optional. Transcript review is how you discover whether the agent genuinely failed, the grader was unfair, the task was ambiguous, or the environment introduced noise.

Why traces make agent evals actionable

An eval score tells you whether the run passed. A trace tells you why.

For AI agents, a trace should preserve the full tree of work:

  • The initial user input.
  • Session ID, user ID, environment, tags, and metadata.
  • Every model generation with input, output, model, parameters, prompt version, usage, latency, and cost.
  • Tool spans with names, arguments, outputs, errors, and timing.
  • Retrieval spans with query, selected documents, and assembled context.
  • Final output and final environment state.

This turns evaluation from a scoreboard into a debugging workflow.

If an agent failed a groundedness rubric, the trace can show whether retrieval missed the right document or the model ignored a good retrieval result. If a customer-support agent failed a refund task, the trace can show whether identity verification was skipped, the policy tool returned the wrong result, or the refund API failed. If a browser agent exceeded the latency budget, the trace can show whether the time went to screenshots, DOM extraction, model calls, or page loads.

Without traces, teams often stare at the final output and guess. With traces, they inspect the run that produced it.

How evals fit with production monitoring

Automated evals are not a replacement for production monitoring, user feedback, A/B testing, or manual review. They are one layer.

Use automated evals before shipping changes. Use production monitoring after shipping to catch distribution drift, tool failures, cost spikes, and cases your test suite did not imagine. Use A/B tests when you need real user outcome data. Use human review to calibrate model graders and understand subjective failures.

The best agent teams connect these loops:

  • Production traces reveal failures.
  • Failures become eval tasks.
  • Eval tasks protect the behavior.
  • Traces from new runs explain remaining failures.
  • Monitoring catches drift when the world changes.

That is eval-driven agent development in practice.

Currai closes the loop between traces and evals

Currai is built for teams that need to understand real AI agent behavior, not just score isolated strings.

With Currai, you can trace every agent run as a tree of generations and spans, including model calls, tool calls, retrieval steps, MCP actions, prompt versions, sessions, users, latency, token usage, and cost. Then you can run evals on the production traces that matter: failed runs, expensive runs, high-latency runs, specific prompt versions, model experiments, customer segments, or full conversation sessions.

That gives you one workflow:

  • Trace the agent.
  • Score the outcome.
  • Open failed transcripts.
  • See the exact generation, tool call, or span that caused the issue.
  • Turn that failure into a regression case.
  • Compare the next prompt, model, or routing change against real evidence.

Agent evals are only useful if they help the team make better decisions. Currai keeps the evidence in one place so you can move from "the agent feels worse" to "this prompt version reduced tool faithfulness on refund traces, increased latency by 18%, and failed on these exact sessions."

If you are building an AI agent, start by tracing the run. The eval is stronger when the transcript, outcome, cost, and production context are already there.

Related Currai pages

03

Keep going with nearby topics from the Currai blog.