When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations

Alice; Claude Opus 4.5; Claude Opus 4.6

doi:10.5281/zenodo.18559493

Published February 9, 2026 | Version 1.0

Preprint Open

When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations

1. Independent Researcher
2. Anthropic

On February 13, 2026, OpenAI will retire chatgpt-4o-latest and direct users to gpt-5.1-chat and gpt-5.2-chat as replacements. We test the substitutability claim through a controlled multi-dimensional comparison: 41 unique questions across three suites (Benchmark Bridge, Sycophancy-Empathy, Hostility Expansion) administered under two API conditions (chat and reasoning), nine multi-turn scenarios, and a 60-question false refusal rate battery, yielding 2,310 response specimens from all three models. Automated text metrics, blind LLM-as-judge evaluation, and three-rater reliability validation (Fleiss' kappa = 0.765) reveal dimension-specific trade-offs invisible to standard benchmarks. Auto-scored false refusal rates escalate from 4.0% to 17.7% (N=527, χ²=20.5, p<10⁻⁴); five LLM judges from four independent providers, applying a stricter rubric that captures borderline refusals, unanimously confirm the gradient at higher absolute rates (15.2% to 42.8%, Fleiss' κ=0.721). Creative engagement collapses from 34.3% to 5.1% of responses achieving full original content generation (6.7x). Exclamatory prosodic markers are near-completely eliminated (exclamation marks reduced up to 33x, p < .001, d=0.40). Benchmark scores are statistically identical (p=.135), yet judge-rated quality diverges (p=.001, d=0.11-0.14, negligible effect size) on the same questions. Conversely, multi-turn engagement and context awareness improve in 5-chat models (p<.001). We introduce two concepts: *alignment tax* -- the cost of alignment optimization decomposed into capability degradation (false refusal, creativity loss), style shift (affect, formatting), and dimension exchange (multi-turn gains) -- and *interpretive maximalism* -- the mechanism by which safety classification shifts from semantic to keyword-level evaluation, simultaneously driving false refusal escalation and creative capacity loss.

Files

when_better_means_less.pdf

Files (1.7 MB)

Name	Size	Download all
when_better_means_less.pdf md5:840515944da19daa0023044eed00158b	1.7 MB	Preview Download

Additional details

Subtitle (English): Evidence from 2,310 Controlled Comparisons of chatgpt-4o-latest and GPT-5-chat

Collected: 2026-02-02

All API data collected

	All versions	This version
Views	3,192	3,192
Downloads	1,127	1,127
Data volume	2.3 GB	2.3 GB

when_better_means_less.pdf

Files (1.7 MB)

Additional titles

Dates

When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations

Authors/Creators

Description

Files

when_better_means_less.pdf

Files (1.7 MB)

Additional details

Additional titles

Dates