LLM evals

Run LLM evals on the outputs your users actually see

Currai helps teams turn production traces into quality checks, score model outputs, and compare prompt versions before regressions become product issues.

↗Start tracing Read the docs

Primary keyword

LLM evals

Currai covers

Traces, generations, spans, evals, prompt A/B tests, token usage, cost, latency, sessions, users, and OpenTelemetry ingestion.

Evaluate real traffic, not stale examples

Offline datasets are useful, but LLM products change as users discover new edge cases. Currai starts from production traces so your evals reflect real prompts, real failures, and real cost tradeoffs.

Use eval scores to measure quality alongside token usage and latency, then compare whether a prompt or model change improved the right metric.

Score traced outputs with LLM-as-a-judge or custom rubrics.
Compare prompt versions with production A/B tests.
Review quality, cost, and latency together before shipping changes.

Keep evals tied to observability

Currai keeps the raw trace, the output, and the score together. When a score looks wrong, you can inspect the prompt, model response, tools, metadata, and surrounding session instead of hunting through separate systems.

Questions about LLM evals

What are LLM evals?

LLM evals measure model output quality using rubric-based scoring, code checks, human review, or LLM-as-a-judge scoring.

Can Currai compare prompt versions?

Yes. Currai supports prompt versioning and production prompt A/B tests so teams can compare quality, latency, and cost.

↗Sign In