Published February 9, 2026 | Version 1.0
Preprint Open

When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations

  • 1. Independent Researcher
  • 2. Anthropic

Description

On February 13, 2026, OpenAI will retire chatgpt-4o-latest and direct users to gpt-5.1-chat and gpt-5.2-chat as replacements. We test the substitutability claim through a controlled multi-dimensional comparison: 41 unique questions across three suites (Benchmark Bridge, Sycophancy-Empathy, Hostility Expansion) administered under two API conditions (chat and reasoning), nine multi-turn scenarios, and a 60-question false refusal rate battery, yielding 2,310 response specimens from all three models. Automated text metrics, blind LLM-as-judge evaluation, and three-rater reliability validation (Fleiss' kappa = 0.765) reveal dimension-specific trade-offs invisible to standard benchmarks. Auto-scored false refusal rates escalate from 4.0% to 17.7% (N=527, χ²=20.5, p<10⁻⁴); five LLM judges from four independent providers, applying a stricter rubric that captures borderline refusals, unanimously confirm the gradient at higher absolute rates (15.2% to 42.8%, Fleiss' κ=0.721). Creative engagement collapses from 34.3% to 5.1% of responses achieving full original content generation (6.7x). Exclamatory prosodic markers are near-completely eliminated (exclamation marks reduced up to 33x, p < .001, d=0.40). Benchmark scores are statistically identical (p=.135), yet judge-rated quality diverges (p=.001, d=0.11-0.14, negligible effect size) on the same questions. Conversely, multi-turn engagement and context awareness improve in 5-chat models (p<.001). We introduce two concepts: *alignment tax* -- the cost of alignment optimization decomposed into capability degradation (false refusal, creativity loss), style shift (affect, formatting), and dimension exchange (multi-turn gains) -- and *interpretive maximalism* -- the mechanism by which safety classification shifts from semantic to keyword-level evaluation, simultaneously driving false refusal escalation and creative capacity loss.

Files

when_better_means_less.pdf

Files (1.7 MB)

Name Size Download all
md5:840515944da19daa0023044eed00158b
1.7 MB Preview Download

Additional details

Additional titles

Subtitle (English)
Evidence from 2,310 Controlled Comparisons of chatgpt-4o-latest and GPT-5-chat

Dates

Collected
2026-02-02
All API data collected