currai
{ pricing: "/pricing", blog: "/blog", docs: "/docs" }
Sign In
All posts
GUIDE6 min read

Run LLM evals on production traces

The Currai team, EngineeringJun 15, 2026

Most eval pipelines start with good intentions and bad data. You collect twenty "representative" prompts in a spreadsheet, score a new prompt against them, ship the winner, and a week later production is full of inputs your spreadsheet never saw.

The problem is not that offline evals are useless. The problem is that your best test set keeps moving. Real users ask messier questions, longer questions, and more varied questions than the neat set you assembled on day one. If you want to know whether a prompt change actually helped, the highest-signal dataset is the traffic you already traced in production.

Observability first, evals second

An eval only has something to judge if you recorded what happened. That means the prompt input, the model output, and enough metadata to tell versions apart later. In Currai, that starts with a normal trace and a normal generation:

const prompt = await currai.getPrompt("bmi-intake");

const trace = currai.trace({
  name: "chat-turn",
  sessionId,
  userId,
});

const gen = trace.generation({
  name: "openai.chat.completions",
  model: "gpt-4o-mini",
  input: prompt.compile({ weight: "70kg", height: "180cm" }),
  promptName: prompt.name,
  promptVersion: prompt.version,
  metadata: { selectedVariant: prompt.selectedVariant },
});

gen.end({
  output: reply,
  usage: { input: 312, output: 94, total: 406, unit: "TOKENS" },
});

Those last two fields, promptName and promptVersion, are what turn a pile of generations into an eval-ready dataset. The extra selectedVariant metadata is useful when an experiment is live, because the trace records which arm actually served the request. Later, Currai can pull the traces for one prompt, group the judged results by version, and tell you whether v4 actually beat v3.

If you are running an active A/B test, the same idea applies. getPrompt resolves the version that served the request, and the generation carries that version into the trace. Now your eval is judging the exact outputs each variant produced in production, not a synthetic reconstruction of what you think users saw.

Why production traces beat static eval sets

A static eval set answers one narrow question: "How does this change behave on the cases I remembered to write down?" A traced production dataset answers the more useful one: "How did this change behave on the traffic I actually served?"

That matters for three reasons:

  • The distribution is real. The long tail of awkward, ambiguous, repetitive, or underspecified inputs shows up automatically.
  • The comparison is fair. When versions are linked into traces, you can compare quality, latency, and cost on the exact same product surface.
  • The loop is short. You do not need to export logs, build a labeling pipeline, and hand-stitch versions back onto responses before learning something useful.

Production traces do not replace deliberate benchmark sets. They stop benchmark sets from being your only source of truth.

What an eval run looks like in Currai

In the dashboard this lives under Playgrounds: pick a prompt or A/B test, choose traces, define rubrics, and find the winning version.

Currai's eval workflow is built around the traces you already captured for a prompt.

  1. Pick a prompt to evaluate.
  2. Optionally scope the run to an A/B test if you want results summarized by variant label instead of just prompt version.
  3. Select the traces to judge from the observed outputs for that prompt.
  4. Define at least two rubrics such as relevance, accuracy, format adherence, or policy compliance.
  5. Choose a provider and a judge model.
  6. Run the eval and compare average scores by version or variant.

The important detail is that Currai is not judging an abstract prompt definition. It is judging the actual input/output pairs that already happened. The judge sees what the model saw and what the model said, then scores each response rubric by rubric on a 1 to 5 scale with a short justification.

The decisions the scores unlock

The most useful eval result is not a global number. It is a concrete decision.

If one prompt version wins on average score without blowing up cost or latency, promote it. If the new version loses, move the label back and keep iterating. If the results are too close to call, tighten the rubrics or run the comparison on a fresh slice of traffic.

This is especially useful with prompt experiments. Instead of arguing about which wording "feels clearer," you can split real traffic, run an eval over the traces each arm generated, and see a winner. Currai's run summary already aggregates results by prompt version, and when an A/B test is in scope it summarizes by variant label too, marking the current winner in the comparison table.

That turns prompt iteration into an operational workflow:

  • Ship a new version to a fraction of traffic.
  • Let traces accumulate.
  • Judge the resulting outputs against explicit rubrics.
  • Promote, roll back, or rewrite based on measured behavior.

Start with one prompt, not an eval platform

You do not need a giant eval taxonomy to get value here. Start with one prompt whose quality matters, make sure its generations carry promptName, promptVersion, and if you're testing variants, metadata.selectedVariant, and run one eval on recent traces with two or three rubrics you can explain to another engineer in one sentence each.

The first thing you learn is usually not that the new version is amazing. It is that a change you were ready to ship confidently is mixed, brittle, or only good for one slice of traffic. That is exactly the point. Evals are not there to confirm your instinct. They are there to stop your instinct from being the only thing you have.

If you have not instrumented the prompt versions yet, start with Prompts & A/B testing. If you are still getting traces flowing, read What is LLM observability? first. The workflow is: capture the real run, then judge it.