Jun 30, 2026

How to improve golden datasets with human review in Currai

Golden datasets get better when they come from real failures. Currai helps teams turn production traces and human review into durable eval cases.

GUIDE6 min readThe Currai team / Product

Currai

Golden datasets are useful, but many teams build them too early. They invent clean examples, write expected answers, and then discover that production users behave nothing like the spreadsheet.

A stronger golden dataset starts from real failures.

Start with traces, not imagination

Production traces show the inputs your users actually send and the outputs your system actually returns. They include prompt versions, model calls, tool results, retrieval context, latency, cost, and session metadata.

That makes them better raw material than synthetic examples. A trace captures the messy wording, missing context, follow-up questions, and workflow state that make AI products hard to evaluate.

Currai lets teams inspect those traces and choose the ones that represent important quality risks.

Use human review to define the answer

Human review is where a trace becomes a golden case. A domain owner can decide what should have happened:

Was the answer correct?
Did it follow policy?
Did it use the available context?
Should it have escalated?
What specific detail was missing?

The human label does not need to be elaborate. Often the most useful review is a short note and a pass/fail judgment tied to a clear rubric.

Keep the dataset focused

A golden dataset should not be a junk drawer of every bad interaction. Keep the set focused on failures you want to prevent from recurring.

Good candidates include:

policy mistakes that create support or compliance risk
hallucinations from missing or ignored retrieval context
agent loops or unnecessary tool calls
multi-turn context failures
high-volume user intents with inconsistent answers

Each case should earn its place by representing a real product risk.

Refresh it continuously

Golden datasets go stale when products, prompts, policies, tools, and users change. Currai helps keep the dataset fresh by making production traces easy to review after every prompt change, model upgrade, or incident.

When a new recurring failure appears, add a representative trace. When a case no longer matters, remove it. When a rubric is ambiguous, tighten it with examples.

The dataset should become a living record of what the team has learned.

Combine human labels with evals

Human review gives the ground truth. Evals help scale it.

A good workflow is to label a small set of traces, write a narrow rubric, run a judge eval, and compare the judge against the human labels. Where the judge disagrees, inspect the trace and improve the rubric.

That is how teams build trust in evals: not by assuming the judge is right, but by measuring it against product judgment.

Back to blog

How to improve golden datasets with human review in Currai

Start with traces, not imagination

Use human review to define the answer

Keep the dataset focused

Refresh it continuously

Combine human labels with evals

How to turn production traces into better AI with Currai

How to earn stakeholder trust with evals and observability

How to evaluate multi-turn customer support conversations in Currai

How to improve golden datasets with human review in Currai

Start with traces, not imagination

Use human review to define the answer

Keep the dataset focused

Refresh it continuously

Combine human labels with evals

Related articles

How to turn production traces into better AI with Currai

How to earn stakeholder trust with evals and observability

How to evaluate multi-turn customer support conversations in Currai