Jun 30, 2026

Evals are a team sport

AI evals work best when product, support, engineering, and domain experts share traces, rubrics, and quality decisions in Currai.

GUIDE5 min readThe Currai team / Product

AI evals fail when they become a private engineering artifact. They also fail when they become a vague product wish list. The useful version sits in the middle: engineering makes the system measurable, and domain owners define what good means.

That makes evals a team sport.

Engineering owns the evidence

Engineers make sure the application emits useful traces. In Currai, that means capturing model generations, spans for retrieval and tools, prompt names, prompt versions, latency, token usage, cost, user IDs, session IDs, and relevant metadata.

Without that instrumentation, the team cannot explain why an answer failed. It can only argue about the final text.

Product owns the quality bar

Product managers decide which behaviors matter for the user and business. A response might be syntactically valid but still fail the product. It might be too generic, skip a required next step, or avoid a decision the workflow exists to make.

PMs turn those expectations into rubrics the team can run repeatedly.

Support owns policy reality

Support teams know where AI answers break down in production. They see refund edge cases, escalation rules, confusing account states, and phrases customers actually use.

When support leaders review traces, they can identify failures a generic judge might miss. Their labels and notes make evals sharper.

Domain experts own risk

In legal, healthcare, finance, education, and other specialized products, domain experts need a direct role in eval design. They define unacceptable advice, required caveats, groundedness standards, and escalation thresholds.

Currai helps by giving them concrete traces to review instead of abstract prompt text.

The shared workflow

The operating loop is simple:

  1. Engineering instruments the product with Currai.
  2. Product and support review production traces.
  3. The team identifies recurring failures.
  4. Domain owners define rubrics.
  5. Engineering and product ship changes.
  6. Currai evals measure whether quality improved.

This creates a shared language. Instead of "the model feels worse," the team can say "refund-policy completeness dropped for prompt version 12 on recent support traffic."

That is a decision-ready signal.

Related: LLM observability, LLM evals, and Evals for PMs.

03

Keep going with nearby topics from the Currai blog.