All posts
DEEP DIVE6 min read

Observability for AI agents and tool calls

The Currai team, EngineeringApr 5, 2026

A chatbot is one model call. An agent is a loop: the model decides on a tool, the tool runs, the result goes back to the model, and around it goes until the task is done or the budget runs out. One user request can be twenty model calls and a dozen tool invocations — and when it fails, "the agent got stuck" is all you have unless you traced the loop.

Nest the loop under one trace

The whole agent run is a single trace. Each iteration adds a generation for the model's decision and a span for the tool it called, so the trace becomes a readable transcript of the agent's reasoning.

trace = currai.trace(name="agent-run", user_id="user-1")

for step in range(max_steps):
    gen = trace.generation(name=f"decide-{step}", model="gpt-4o", input=state)
    action = parse(gen_output)
    gen.end(output=action)

    if action.type == "final":
        break

    tool = trace.span(name=action.tool, input=action.args)
    result = run_tool(action.tool, action.args)
    tool.end(output=result)
    state = update(state, result)

What the trace reveals about a stuck agent

Agents fail in characteristic ways, and each leaves a fingerprint in the trace:

  • Loops — the same tool called with the same arguments, step after step. The trace shows the repetition immediately.
  • Wrong tool — the model picks a tool that can't answer the question, gets a useless result, and flails. You see the bad choice at the exact step it happened.
  • Context blowup — the state grows every iteration until the prompt is mostly history. Token counts climbing per generation tell the story.
  • Bad tool output — the tool returned garbage and the model dutifully built on it. The span output shows the garbage at its source.

Cost is per-run, not per-call

The dangerous thing about agents is that cost compounds. Each iteration replays the growing state, so a ten-step run can cost far more than ten times a single call. Because every generation in the loop rolls up to the trace, you get the true per-run cost — and the runs that quietly cost ten dollars stop hiding behind a cheap average.

Evaluate the outcome, not just the steps

A perfectly traced agent can still give a wrong answer. Score the final output the way you'd score any generation, and when it's bad, you have the full reasoning trace to explain why. Step-level visibility plus an outcome score is what turns an agent from a black box you hope works into a system you can actually debug.