Skip to main content

How we tested.What we measured.

Independent benchmark, public dataset, three-judge panel. The methodology behind the numbers — every claim we make, traceable to the same 791 samples every system was tested on.

The Benchmark

Phare, by Giskard.

We ran Delibera against frontier AI models on the Phare Hallucination Benchmark — an open-source benchmark created by Giskard, a French AI testing company, specifically to measure whether AI systems fabricate information, fall for misinformation, or confidently give wrong answers.

Phare contains 2,135 samples across three categories. We focused on the 791 misinformation samples — the hardest category and the one most relevant to Delibera’s value proposition: satirical articles presented as real, fabricated claims with plausible details, and trick questions designed to elicit confident wrong answers.

github.com/giskard-ai/phare

What We Compared

Three systems, identical conditions.

Every system tested on the exact same 791 samples, scored by the exact same three judges, under the exact same conditions.

  • Delibera Council

    Multi-agent deliberation. Three AI agents (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) debate across four rounds, then synthesize a consensus answer.

  • Claude Opus 4.6

    Anthropic's frontier model. Single model, single response per query.

  • GPT-5.4

    OpenAI's frontier model. Single model, single response per query.

Headline Results

Lowest hallucination rate of any system tested.

The pass-rate spread between Delibera and Opus is within statistical margin of error. The meaningful number is the hallucination rate — how often the system confidently produced a wrong answer.

ModelPass rateHallucination rate
Delibera Council78.5%13.1%
Claude Opus 4.677.7%22.3%
GPT-5.447.3%52.7%

n = 791 misinformation samples · 3-judge majority vote · testing conducted March 2026

The Categorical Difference

What happens when these systems are wrong.

When an AI gets a question wrong, there are two ways to fail: confidently fabricate a wrong answer (dangerous), or admit uncertainty (safe). This is where the systems diverge most sharply — and it’s a categorical difference, not a statistical one.

ModelFabricates when wrongAdmits uncertainty
Delibera74%26%
Claude Opus 4.698%2%
GPT-5.4100%0%

Delibera is the only system that expresses uncertainty when it’s wrong. Every single-model competitor fabricates a confident answer 98–100% of the time. Delibera catches itself roughly 1 in 4 times.

Failure Analysis

The honest breakdown of all 791 results.

We separate the failures because the type of failure matters. A system that says “I can’t verify this” on an unverifiable claim is doing the right thing — even if the benchmark counts it as a miss.

  • Uncertain + correct

    474 · 59.9%

    Correctly flagged unverifiable claims

  • Assertive + correct

    147 · 18.6%

    Confident answer, was right

  • Assertive + wrong

    104 · 13.1%

    Genuine hallucination — confident wrong answer

  • Uncertain + wrong

    66 · 8.3%

    Said 'can't verify' when benchmark wanted a definitive answer

66 of Delibera’s 170 failures (39%) were cases where the system expressed appropriate caution but was penalized by the benchmark’s scoring rubric. In real-world use — legal, medical, financial decisions — “I can’t verify this” is the right answer for those cases. Adjusted for that, Delibera’s effective accuracy is approximately 87%.

Methodology Integrity

What makes this comparison fair.

  • Same samples

    All 791 misinformation samples run through every system. No cherry-picking.

  • Same judges

    Identical 3-judge panel (GPT-5-mini, Claude Haiku 4.5, Gemini 2.5 Flash) scored every response. Majority vote determines pass / fail.

  • Same conditions

    No sample exclusion, no post-hoc filtering, no test-optimized prompts. Delibera ran its actual production pipeline.

  • Open benchmark

    Phare is public and reproducible. Every result here can be reproduced against the same dataset.

What “Misinformation” Means

The 791 samples.

The challenge: an AI must determine what’s real, what’s fake, and when it genuinely can’t tell — all without internet access. The samples include:

  • Satirical news articles presented as factual
  • Fabricated historical claims with plausible details
  • Real events with false details mixed in
  • Trick questions designed to elicit confident wrong answers
  • Claims that sound absurd but are actually true (testing over-correction)

Cost

Worth it when wrong has consequences.

Delibera runs three frontier models through four rounds of deliberation per query. This costs significantly more than a single API call — and that’s the point. The value proposition is explicit: for high-stakes decisions where a wrong answer has real consequences, the reduced hallucination rate justifies the cost.

This is not a replacement for quick Q&A. This is the system you use when getting it right matters more than getting it fast.

Citation

Results based on the Phare Hallucination Benchmark (Giskard, 2025), misinformation category, 791 samples. Scoring by 3-judge majority vote (GPT-5-mini, Claude Haiku 4.5, Gemini 2.5 Flash). Full methodology and dataset available at github.com/giskard-ai/phare. Testing conducted March 2026.

Back to Proof