Published May 21, 2026 | Version 1.0
Preprint Open

Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks

Authors/Creators

Description

On 67% of 1,000 recent real-user fact-check claims, a panel of five frontier LLMs splits — at least one model dissents from the majority verdict, or no strict majority forms at all. The five models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro) were each given the same claim and asked to pick a verdict from a 4-bucket rubric (True / Mostly True / Misleading / False). Because exactly one bucket can be correct per claim, any disagreement among the panel means at least one model is label-inconsistent.

Key findings:

  • 67% of claims (672/1,000; 95% CI 64–70%) have at least one frontier model dissenting from the panel majority, or no strict majority forming at all.
  • 34% of claims (343/1,000; 95% CI 31–37%) involve a substantive disagreement — a ≥2-bucket gap between the most-disagreeing pair of frontier verdicts.
  • Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement.
  • Unanimity concentrates at the True/False poles: of 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True.

The claims are real recent submissions to Lenz, a fact-checking platform — not curated benchmarks — so the disagreement is contamination-resistant by construction. No LLM grader; all measurements derive from direct parsed-label equality across the 5 verdicts. Wilson 95% CIs on every reported rate.

This deposit contains the v1.0 PDF snapshot. Full per-claim CSV, HTML rendering, methodology, and changelog: https://lenz.io/research/llm-disagreement

Files

lenz-llm-disagreement-v1.0.pdf

Files (999.2 kB)

Name Size Download all
md5:ffe979e7beaa11e33ce4f9c48ddb5ea2
747.0 kB Preview Download
md5:45624ee6c76f9e801c1f7cfbd26df262
252.2 kB Preview Download

Additional details

Related works

Is identical to
Preprint: https://lenz.io/research/llm-disagreement (URL)