Jun 29, 2026

PMs should own the AI eval loop

AI evals are not just engineering tests. Product managers need real traces, domain judgment, and a repeatable loop for turning model failures into product improvements.

GUIDE7 min readThe Currai team / Product

The best AI product teams are not the ones with the longest prompts, the most elaborate agent graphs, or the strongest opinions about which model is winning this week. They are the teams that can look at what their product actually did, understand where it failed, and turn that into a repeatable improvement loop.

That is what evals are really for.

Not perfect grades. Not academic benchmarks. Not a dashboard full of generic scores nobody trusts. Practical evals are a way to make an AI product better from the evidence of real user interactions.

For many products, the person who needs to drive that loop is the product manager.

Evals are product work

It is tempting to define evals as unit tests for AI. That is partly true, but too small.

A unit test can tell you whether the model returned valid JSON, included a required citation, or called a tool with the right schema. Those checks are useful. They are also only one slice of product quality.

Most AI failures are messier:

  • The assistant technically answered the question but ended the conversation too early.
  • The agent used the right tool but missed the user's actual intent.
  • The model gave a plausible response using stale or incomplete context.
  • The experience felt robotic in a moment where the product needed tact.
  • The assistant should have escalated to a human but tried to keep going.

Those are not just model problems. They are product problems.

An engineer can inspect whether a tool call succeeded. A judge model can help classify a narrow failure mode. But someone with product and domain context has to decide whether the interaction was good enough for the user and the business.

That is why PMs belong close to evals.

Start by looking at traces

The highest-value eval work usually starts before anyone writes an automated evaluator. It starts with reading real traces.

In Currai, a trace is the full record of one logical AI interaction. It can include the user input, the model generation, tool calls, retrieval steps, MCP activity, latency, token usage, errors, metadata, and final output.

That matters because product quality often depends on the sequence, not just the last message.

If a leasing assistant tells a prospect that no matching apartment is available, the final answer might look factual. But the trace can show the missed product moment: the user was high intent, the assistant had enough context to know the request mattered, and the right next step was a human handoff.

Without traces, that becomes an anecdote.

With traces, it becomes a pattern you can measure.

The first eval is a notes exercise

The first pass should be deliberately simple.

Pull a sample of real traces. Read them one by one. For each trace, write down the first meaningful thing that went wrong. Do not build a taxonomy up front. Do not try to label every possible issue. Do not ask a model to replace your judgment at this stage.

Write plain notes:

  • Did not hand off when the user needed a human.
  • Claimed a capability the product does not support.
  • Asked a repeat question despite already having the answer.
  • Responded to a fragmented text message as if it were a complete request.
  • Gave a correct answer but failed to move the workflow forward.

This is where PM taste matters. A generic model can miss failures that require business context. It may read a response and decide it looks fine because it is coherent, polite, and not obviously false. The PM can see that the response failed the product.

That difference is the point.

Appoint one domain owner

Teams often make evals too expensive by turning early labeling into a committee exercise. That slows everything down before the team has learned anything.

For the first pass, appoint one domain owner whose judgment the team trusts. In many AI products, that person is the PM. In a legal product, it may be a lawyer. In a clinical workflow, it may be a clinician. In a support product, it may be the support lead who knows the escalation policy better than anyone else.

The goal is not universal agreement. The goal is to get enough signal to improve the product.

Once the first set of notes exists, you can group them into recurring failure modes. This is where AI can help. A model is useful for clustering notes, suggesting categories, and turning a messy list into a first draft of a failure taxonomy.

The human still owns the judgment. Rename the categories. Merge the ones that are too broad. Split the ones that hide different product problems. Add a "none of the above" bucket when your taxonomy misses something.

Then count.

Counting is not sophisticated, but it is powerful. If 22% of sampled traces have handoff issues, that is no longer a vague concern. It is a product quality problem with a clear owner.

Fix obvious issues before automating

Not every failure mode deserves an eval.

Some problems are simple engineering or prompt issues. If the assistant is missing a required sentence because the prompt never mentions it, fix the prompt. If JSON is malformed, add a schema check. If a tool error is being swallowed, surface the error and handle it.

Automated evals are most useful for recurring, high-risk failures that are hard to catch with code alone.

For example:

  • Should this conversation have been handed off to a human?
  • Did the assistant make a promise the system cannot fulfill?
  • Did the answer resolve the user's actual task?
  • Did the agent use retrieved context faithfully?
  • Did the response follow the product's policy for a sensitive topic?

Those are narrow enough for an LLM judge, but still require judgment that a simple string check cannot provide.

Keep judge evals narrow and binary

LLM-as-judge evals fail when they are vague.

"Rate this response from 1 to 5" sounds convenient, but it often creates numbers that are hard to interpret and harder to trust. A 3.7 average does not tell the team what to ship.

Use binary questions wherever possible:

  • Did a handoff failure occur?
  • Did the model hallucinate an unsupported capability?
  • Did the response satisfy the user's scheduling request?
  • Did the assistant cite information not present in the retrieved context?

Binary evals force the team to define what good enough means. They also make it easier to compare the judge against human-labeled examples.

That last step is critical. Before trusting an LLM judge, run it against traces the PM or domain owner already labeled. Look at where the judge disagrees. If it misses failures the human caught, tighten the rubric. If it flags good interactions as bad, clarify the criteria.

Do not ship a judge because the prompt looks reasonable. Ship it because it matches your product judgment well enough to be useful.

Currai turns evals into an operating loop

Currai is built around the raw material this process needs: production traces.

When your AI application is instrumented with Currai, every important model call can be recorded as a generation, and every non-model step can be recorded as a span. Tool calls, retrieval, MCP connections, workflow steps, and errors all sit inside the same trace.

That gives PMs and engineers a shared view of what happened.

The loop becomes straightforward:

  1. Capture real user interactions as Currai traces.
  2. Review a focused sample each week.
  3. Write PM-owned notes on the first meaningful failure.
  4. Cluster those notes into recurring failure modes.
  5. Fix obvious product, prompt, or engineering issues.
  6. Turn persistent high-risk failures into automated evals.
  7. Run those evals on new production traces.
  8. Watch whether the failure rate goes down.

This is much more useful than arguing about whether a prompt "feels better." It connects product judgment to production evidence.

It also creates a shared language. Instead of saying "the assistant is bad at handoffs," the team can say "human handoff failures dropped from 18% to 6% after we changed the escalation policy and tool prompt."

That is an AI product team getting sharper.

Evals are living PRDs

Traditional PRDs describe how the product should work before it is built. AI products still need that. But the real behavior of an AI system is discovered after users start interacting with it.

That makes evals a living extension of the PRD.

The trace review shows what users actually ask. The notes capture what the product should have done. The failure taxonomy turns judgment into categories. The automated evals turn those categories into ongoing measurement.

For AI products, requirements do not just live in a document. They live in the feedback loop between production behavior and product judgment.

PMs should own that loop because PMs own the product definition of "good."

Start small

You do not need a giant eval platform to begin. Start with one important workflow, one week of traces, and one PM or domain expert willing to read them carefully.

Look for the first meaningful failure. Write it down. Group the patterns. Count them. Fix what is obvious. Automate only what is worth monitoring.

The goal is not to do evals perfectly.

The goal is to make the product better every week.

If your team already has traces flowing through Currai, start with a recent slice of production traffic. If you are still setting up observability, read Trace your first LLM call or Run LLM evals on production traces.

Related Currai pages

03

Keep going with nearby topics from the Currai blog.