Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks

Jordanov, Kosta

doi:10.5281/zenodo.20344847

Published May 21, 2026 | Version 1.0

Preprint Open

Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks

Jordanov, Kosta

On 67% of 1,000 recent real-user fact-check claims, a panel of five frontier LLMs splits — at least one model dissents from the majority verdict, or no strict majority forms at all. The five models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro) were each given the same claim and asked to pick a verdict from a 4-bucket rubric (True / Mostly True / Misleading / False). Because exactly one bucket can be correct per claim, any disagreement among the panel means at least one model is label-inconsistent.

Key findings:

67% of claims (672/1,000; 95% CI 64–70%) have at least one frontier model dissenting from the panel majority, or no strict majority forming at all.
34% of claims (343/1,000; 95% CI 31–37%) involve a substantive disagreement — a ≥2-bucket gap between the most-disagreeing pair of frontier verdicts.
Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement.
Unanimity concentrates at the True/False poles: of 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True.

The claims are real recent submissions to Lenz, a fact-checking platform — not curated benchmarks — so the disagreement is contamination-resistant by construction. No LLM grader; all measurements derive from direct parsed-label equality across the 5 verdicts. Wilson 95% CIs on every reported rate.

This deposit contains the v1.0 PDF snapshot. Full per-claim CSV, HTML rendering, methodology, and changelog: https://lenz.io/research/llm-disagreement

Files

lenz-llm-disagreement-v1.0.pdf

Files (999.2 kB)

Name	Size	Download all
lenz-llm-disagreement-v1.0.pdf md5:ffe979e7beaa11e33ce4f9c48ddb5ea2	747.0 kB	Preview Download
lenz-llm-disagreement.csv md5:45624ee6c76f9e801c1f7cfbd26df262	252.2 kB	Preview Download

Additional details

Is identical to: Preprint: https://lenz.io/research/llm-disagreement (URL)

	All versions	This version
Views	287	287
Downloads	177	177
Data volume	154.7 MB	154.7 MB

Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks

Authors/Creators

Description

Files

lenz-llm-disagreement-v1.0.pdf

Files (999.2 kB)

Additional details

Related works