Jun 30, 2026

How to improve golden datasets with human review in Currai

Golden datasets get better when they come from real failures. Currai helps teams turn production traces and human review into durable eval cases.

GUIDE6 min readThe Currai team / Product

Golden datasets are useful, but many teams build them too early. They invent clean examples, write expected answers, and then discover that production users behave nothing like the spreadsheet.

A stronger golden dataset starts from real failures.

Start with traces, not imagination

Production traces show the inputs your users actually send and the outputs your system actually returns. They include prompt versions, model calls, tool results, retrieval context, latency, cost, and session metadata.

That makes them better raw material than synthetic examples. A trace captures the messy wording, missing context, follow-up questions, and workflow state that make AI products hard to evaluate.

Currai lets teams inspect those traces and choose the ones that represent important quality risks.

Use human review to define the answer

Human review is where a trace becomes a golden case. A domain owner can decide what should have happened:

  • Was the answer correct?
  • Did it follow policy?
  • Did it use the available context?
  • Should it have escalated?
  • What specific detail was missing?

The human label does not need to be elaborate. Often the most useful review is a short note and a pass/fail judgment tied to a clear rubric.

Keep the dataset focused

A golden dataset should not be a junk drawer of every bad interaction. Keep the set focused on failures you want to prevent from recurring.

Good candidates include:

  • policy mistakes that create support or compliance risk
  • hallucinations from missing or ignored retrieval context
  • agent loops or unnecessary tool calls
  • multi-turn context failures
  • high-volume user intents with inconsistent answers

Each case should earn its place by representing a real product risk.

Refresh it continuously

Golden datasets go stale when products, prompts, policies, tools, and users change. Currai helps keep the dataset fresh by making production traces easy to review after every prompt change, model upgrade, or incident.

When a new recurring failure appears, add a representative trace. When a case no longer matters, remove it. When a rubric is ambiguous, tighten it with examples.

The dataset should become a living record of what the team has learned.

Combine human labels with evals

Human review gives the ground truth. Evals help scale it.

A good workflow is to label a small set of traces, write a narrow rubric, run a judge eval, and compare the judge against the human labels. Where the judge disagrees, inspect the trace and improve the rubric.

That is how teams build trust in evals: not by assuming the judge is right, but by measuring it against product judgment.

Related: LLM evals, Run LLM evals on production traces, and Evals are a team sport.

03

Keep going with nearby topics from the Currai blog.