Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks
Authors/Creators
Description
On 67% of 1,000 recent real-user fact-check claims, a panel of five frontier LLMs splits — at least one model dissents from the majority verdict, or no strict majority forms at all. The five models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro) were each given the same claim and asked to pick a verdict from a 4-bucket rubric (True / Mostly True / Misleading / False). Because exactly one bucket can be correct per claim, any disagreement among the panel means at least one model is label-inconsistent.
Key findings:
- 67% of claims (672/1,000; 95% CI 64–70%) have at least one frontier model dissenting from the panel majority, or no strict majority forming at all.
- 34% of claims (343/1,000; 95% CI 31–37%) involve a substantive disagreement — a ≥2-bucket gap between the most-disagreeing pair of frontier verdicts.
- Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement.
- Unanimity concentrates at the True/False poles: of 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True.
The claims are real recent submissions to Lenz, a fact-checking platform — not curated benchmarks — so the disagreement is contamination-resistant by construction. No LLM grader; all measurements derive from direct parsed-label equality across the 5 verdicts. Wilson 95% CIs on every reported rate.
This deposit contains the v1.0 PDF snapshot. Full per-claim CSV, HTML rendering, methodology, and changelog: https://lenz.io/research/llm-disagreement
Files
lenz-llm-disagreement-v1.0.pdf
Files
(999.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:ffe979e7beaa11e33ce4f9c48ddb5ea2
|
747.0 kB | Preview Download |
|
md5:45624ee6c76f9e801c1f7cfbd26df262
|
252.2 kB | Preview Download |
Additional details
Related works
- Is identical to
- Preprint: https://lenz.io/research/llm-disagreement (URL)