All posts
GUIDE6 min read

Why you should A/B test your LLM prompts

The Currai team, EngineeringJun 7, 2026

You change "Be helpful" to "Be concise," it reads better in your editor, you ship it. A week later your token bill is down 15% — or your support tickets are up, because "concise" quietly dropped a step users relied on. You can't tell which, because nothing measured the change. Prompt edits feel small, but they're code changes to the most behaviorally sensitive part of your app, and "looks better to me" is not a measurement.

The problem with "looks better to me"

Prompts are deceptively easy to edit and deceptively hard to evaluate. Three things make eyeballing unreliable:

  • Small wording changes have outsized effects. Reordering instructions, adding a single example, or tightening a sentence can move output format, refusal rate, latency, and cost — often in directions you didn't intend.
  • Your eval set drifts from real traffic. A handful of test prompts can look great while the long tail of real user inputs regresses. The only fully representative test set is production.
  • Quality and cost trade off against each other. A longer prompt might raise answer quality 3% and cost 40%. Whether that's worth it is a business call you can only make with both numbers in front of you.

The result is that most prompt iteration is a series of confident guesses, each shipped to 100% of users with no way to compare it against what it replaced.

What A/B testing a prompt actually means

An A/B test keeps two (or more) versions of the same prompt live at once and splits real traffic between them by weight. Version A — your current production wording — keeps serving most requests; version B gets a slice. Every request is tagged with the version that served it, so when you look at cost, latency, and quality you can group by version and compare like for like.

That's the whole idea: instead of replacing A with B and hoping, you run them against the same live distribution of users at the same time, and let the numbers decide.

Why it's worth the setup

You catch regressions before they're everywhere. Roll a new wording out to 10% of traffic. If error rate, latency, or cost moves the wrong way, you've exposed 10% of users to it, not all of them — and you roll back by moving a label, not by shipping a revert.

You quantify the trade-off instead of arguing about it. "B is 4% better on thumbs-up and 30% cheaper" ends the debate. So does "B is no better and slower" — which is just as valuable, because it stops you from shipping a change that felt like progress.

You make decisions from production, not vibes. The split runs against the exact traffic you actually serve, including the weird inputs your eval set never imagined.

How it works in Currai

In Currai a prompt is a versioned object you fetch at runtime. You set up an experiment in the dashboard — pick the versions, give each a label and a weight, hit activate — and your code doesn't change at all. getPrompt resolves the active experiment for you with a weighted pick:

const prompt = await currai.getPrompt("bmi-intake");

// With an active experiment, resolution is a weighted pick across its variants.
// Otherwise it falls back to the `production` label, then the latest version.
prompt.selectedVariant; // { label: "concise", weight: 1 } | null

The one thing you do add is a link from the served version to the trace, so the split shows up in your data. Pass promptName and promptVersion onto the generation:

const gen = trace.generation({
  name: "openai.chat.completions",
  model: "gpt-4o-mini",
  input: prompt.compile({ weight: "70kg", height: "180cm" }),
  promptName: prompt.name,
  promptVersion: prompt.version,
});

Now every trace carries the version that produced it. Group your cost and latency rollups by promptVersion, watch the two arms diverge, and when one wins, promote it by moving the production label onto that version — no deploy, and an instant rollback if you change your mind.

Not on the TypeScript SDK? The same resolution is available over REST at GET /api/public/prompts?name=…, so any language can participate in the split.

Start small

You don't need a stats platform to get value here. Pick one prompt that matters, make the change you've been meaning to make, and run it at 90/10 for a few days with the version linked into your traces. The first time a "obviously better" rewrite turns out to be a wash — or quietly worse — you'll stop shipping prompt changes blind.

Ready to set one up? See Prompts & A/B testing for the full walkthrough.