Jun 19, 2026

AI hallucination evaluations: metrics and methods for 2026

Hallucination evals work when they score the right failure mode. Currai ties groundedness, factuality, consistency, and review back to the production traces that created the answer.

GUIDE8 min readThe Currai team / Engineering

TL;DR: hallucination evaluation is not one metric. A model can be unsupported by retrieved context, wrong against the real world, inconsistent with the conversation, or unstable across repeated generations. The useful setup is to pick the metric that matches the failure, run it against real traced inputs and outputs, and turn bad cases into regressions.

Currai makes that loop practical because the evaluation data starts in the trace. The prompt, completion, retrieved chunks, tool outputs, prompt version, latency, tokens, user, and session already sit together. An eval can judge the actual answer users saw instead of a reconstructed example in a spreadsheet.

What hallucination means in an eval

In product conversations, "hallucination" often means "the answer was bad." That is too vague to evaluate. For measurement, split hallucinations into separate failure modes:

  • Groundedness: does the answer stay supported by retrieved context?
  • Faithfulness: does the output preserve the meaning of the source text?
  • Factuality: is the claim correct against a verified answer or real-world truth?
  • Conversation consistency: does the answer contradict earlier user or assistant turns?
  • Generation stability: do repeated runs for the same task agree, or is the model guessing?

These checks overlap, but they are not interchangeable. A RAG answer can be grounded in the retrieved chunks and still be factually wrong if retrieval found the wrong document. A summary can be factual in the abstract and still be unfaithful to the document it was asked to summarize. A support agent can answer correctly once and contradict itself two turns later.

The first decision in a hallucination eval is therefore not "which model should judge this?" It is "which failure would hurt the user?"

The core hallucination metrics

Groundedness is the default metric for RAG. It asks whether each important claim in the answer is supported by the documents, snippets, or tool results the system actually supplied to the model. It is strict by design: a true statement can still fail groundedness if the model was not given evidence for it.

Faithfulness is the right check for summarization, extraction, rewriting, and report generation. The question is whether the output stayed inside the meaning of the source. This catches invented details, dropped qualifiers, and subtle rewrites that change the source's intent.

Factuality compares the output to a verified answer or source of truth. This is useful for open-domain QA, internal knowledge bases with approved answers, and workflows where the retrieval context is not enough to define correctness.

Consistency looks across the conversation or across repeated generations. In a multi-turn chat, consistency means the assistant does not contradict known facts from the session. In free-form generation, consistency sampling means the same prompt should not produce semantically incompatible answers.

Format and policy adherence are not hallucination metrics on their own, but they often belong in the same eval run. If an answer invents JSON fields or ignores a "cite sources" requirement, the user experiences it as the same kind of trust failure.

The methods that produce the score

Once you know the metric, choose the scoring method.

LLM-as-judge is the practical default. A strong judge model reads the input, output, reference material, and rubric, then returns a score with a short reason. It is flexible enough for new products and subjective criteria, but it needs calibration. Use reviewed examples to make sure "3 out of 5" means the same thing to the judge that it means to the team.

Rule-based checks work when the failure is deterministic: missing citations, invalid JSON, banned phrases, tool calls that violate policy, or answers that omit a required field. They are cheap, fast, and easy to trust, but they only catch what you can specify.

Semantic similarity and embedding checks help when exact wording varies but meaning should match. They are useful for regression tests and retrieval quality, but they are usually not enough for high-risk factuality on their own.

Repeated sampling catches unstable answers when no gold reference exists. Run the same prompt several times, group answers by meaning, and flag high disagreement. The trade-off is cost: every sample is another model call.

Human review remains the calibration layer. It does not scale to every request, but reviewed traces are how you validate judge prompts, build gold sets, and decide which failures matter enough to gate releases.

Why production traces are the best eval dataset

Static eval sets are useful, but they go stale. Users change how they ask questions, retrieval data changes, tools change, prompts change, and new failure modes appear after launch. Production traces keep the eval close to reality.

A Currai trace gives the scorer the full context of the answer:

  • The user input and conversation session.
  • The model output that was actually returned.
  • The prompt name and version that produced the answer.
  • Retrieval spans with documents or chunks.
  • Tool spans with inputs and outputs.
  • Token usage, latency, errors, tags, and metadata.

That context matters. If a groundedness score drops, you can inspect the trace and see whether retrieval missed the right document, the prompt ignored it, or the final generation invented a claim after a tool returned the right answer. Without the trace, the eval only tells you "bad." With the trace, it tells you where to look.

A Currai workflow for hallucination evals

Start by instrumenting the path you want to evaluate. For a RAG answer, the trace should contain the generation and the retrieval span:

const trace = currai.trace({
  name: "support-answer",
  sessionId,
  userId,
  input: { question },
  tags: ["rag", "support"],
});

const retrieval = trace.span({
  name: "retrieval.search",
  input: { question },
});

retrieval.end({
  output: { chunks },
});

const gen = trace.generation({
  name: "openai.chat.completions",
  model: "gpt-4o-mini",
  input: { question, chunks },
  promptName: "support-answer",
  promptVersion: prompt.version,
});

gen.end({
  output: answer,
  usage: { input: 1400, output: 220, total: 1620, unit: "TOKENS" },
});

trace.update({ output: answer });

Then run the eval against traces for that prompt or feature. Use groundedness for claims that should be supported by chunks, faithfulness for summarization, factuality when you have verified answers, and format checks for structured outputs. When a trace fails, keep it. That failed production example is now a regression case for the next prompt version.

Choosing the right eval by task

For RAG question answering, start with groundedness and retrieval relevance. The retrieved context is the reference. Add factuality when the retrieved data can be stale or incomplete.

For summarization, use faithfulness to the source document and a separate format check if the output has a required structure.

For tool-using agents, evaluate faithfulness to tool outputs, final-answer factuality, and policy around which tools were allowed. Tool spans are important because a wrong final answer may come from a bad tool call rather than the final generation.

For free-form generation, use repeated sampling or rubric-based judging. No single answer key exists, so score the properties that matter: consistency, usefulness, tone, policy, or completeness.

For customer-facing support, combine groundedness, conversation consistency, and human review on low-scoring traces. Support answers are usually judged by trust, not just by whether one sentence was technically true.

FAQs: AI hallucination evaluations

Can I evaluate hallucinations without a gold dataset?

Yes. RAG systems can use retrieved context as the reference for groundedness. Summarization can use the source document. Free-form generation can use repeated sampling or rubric judging. A gold dataset is still useful for calibration and release gates, but it is not required for every eval.

Is LLM-as-judge reliable enough?

It can be reliable enough when the rubric is specific and the scores are checked against human-reviewed examples. Do not treat an uncalibrated judge as a release gate. Start by comparing judge results to reviewed traces, then tighten the rubric where the judge disagrees.

What should I do with a failed hallucination eval?

Open the trace first. Check retrieval, tool output, prompt version, model input, and final generation. Then save the failed case as a regression. The goal is not only to find one bad answer; it is to make sure the same class of answer does not ship again.

Related Currai pages

03

Keep going with nearby topics from the Currai blog.