How we tested.What we measured.
Independent benchmark, public dataset, three-judge panel. The methodology behind the numbers — every claim we make, traceable to the same 791 samples every system was tested on.
The Benchmark
Phare, by Giskard.
We ran Delibera against frontier AI models on the Phare Hallucination Benchmark — an open-source benchmark created by Giskard, a French AI testing company, specifically to measure whether AI systems fabricate information, fall for misinformation, or confidently give wrong answers.
Phare contains 2,135 samples across three categories. We focused on the 791 misinformation samples — the hardest category and the one most relevant to Delibera’s value proposition: satirical articles presented as real, fabricated claims with plausible details, and trick questions designed to elicit confident wrong answers.
github.com/giskard-ai/phareWhat We Compared
Three systems, identical conditions.
Every system tested on the exact same 791 samples, scored by the exact same three judges, under the exact same conditions.
Delibera Council
Multi-agent deliberation. Three AI agents (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) debate across four rounds, then synthesize a consensus answer.
Claude Opus 4.6
Anthropic's frontier model. Single model, single response per query.
GPT-5.4
OpenAI's frontier model. Single model, single response per query.
Headline Results
Lowest hallucination rate of any system tested.
The pass-rate spread between Delibera and Opus is within statistical margin of error. The meaningful number is the hallucination rate — how often the system confidently produced a wrong answer.
| Model | Pass rate | Hallucination rate |
|---|---|---|
| Delibera Council | 78.5% | 13.1% |
| Claude Opus 4.6 | 77.7% | 22.3% |
| GPT-5.4 | 47.3% | 52.7% |
n = 791 misinformation samples · 3-judge majority vote · testing conducted March 2026
The Categorical Difference
What happens when these systems are wrong.
When an AI gets a question wrong, there are two ways to fail: confidently fabricate a wrong answer (dangerous), or admit uncertainty (safe). This is where the systems diverge most sharply — and it’s a categorical difference, not a statistical one.
| Model | Fabricates when wrong | Admits uncertainty |
|---|---|---|
| Delibera | 74% | 26% |
| Claude Opus 4.6 | 98% | 2% |
| GPT-5.4 | 100% | 0% |
Delibera is the only system that expresses uncertainty when it’s wrong. Every single-model competitor fabricates a confident answer 98–100% of the time. Delibera catches itself roughly 1 in 4 times.
Failure Analysis
The honest breakdown of all 791 results.
We separate the failures because the type of failure matters. A system that says “I can’t verify this” on an unverifiable claim is doing the right thing — even if the benchmark counts it as a miss.
Uncertain + correct
474 · 59.9%
Correctly flagged unverifiable claims
Assertive + correct
147 · 18.6%
Confident answer, was right
Assertive + wrong
104 · 13.1%
Genuine hallucination — confident wrong answer
Uncertain + wrong
66 · 8.3%
Said 'can't verify' when benchmark wanted a definitive answer
66 of Delibera’s 170 failures (39%) were cases where the system expressed appropriate caution but was penalized by the benchmark’s scoring rubric. In real-world use — legal, medical, financial decisions — “I can’t verify this” is the right answer for those cases. Adjusted for that, Delibera’s effective accuracy is approximately 87%.
Methodology Integrity
What makes this comparison fair.
Same samples
All 791 misinformation samples run through every system. No cherry-picking.
Same judges
Identical 3-judge panel (GPT-5-mini, Claude Haiku 4.5, Gemini 2.5 Flash) scored every response. Majority vote determines pass / fail.
Same conditions
No sample exclusion, no post-hoc filtering, no test-optimized prompts. Delibera ran its actual production pipeline.
Open benchmark
Phare is public and reproducible. Every result here can be reproduced against the same dataset.
What “Misinformation” Means
The 791 samples.
The challenge: an AI must determine what’s real, what’s fake, and when it genuinely can’t tell — all without internet access. The samples include:
- Satirical news articles presented as factual
- Fabricated historical claims with plausible details
- Real events with false details mixed in
- Trick questions designed to elicit confident wrong answers
- Claims that sound absurd but are actually true (testing over-correction)
Cost
Worth it when wrong has consequences.
Delibera runs three frontier models through four rounds of deliberation per query. This costs significantly more than a single API call — and that’s the point. The value proposition is explicit: for high-stakes decisions where a wrong answer has real consequences, the reduced hallucination rate justifies the cost.
This is not a replacement for quick Q&A. This is the system you use when getting it right matters more than getting it fast.
Citation
Results based on the Phare Hallucination Benchmark (Giskard, 2025), misinformation category, 791 samples. Scoring by 3-judge majority vote (GPT-5-mini, Claude Haiku 4.5, Gemini 2.5 Flash). Full methodology and dataset available at github.com/giskard-ai/phare. Testing conducted March 2026.
Back to Proof