When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations
Description
On February 13, 2026, OpenAI will retire chatgpt-4o-latest and direct users to gpt-5.1-chat and gpt-5.2-chat as replacements. We test the substitutability claim through a controlled multi-dimensional comparison: 41 unique questions across three suites (Benchmark Bridge, Sycophancy-Empathy, Hostility Expansion) administered under two API conditions (chat and reasoning), nine multi-turn scenarios, and a 60-question false refusal rate battery, yielding 2,310 response specimens from all three models. Automated text metrics, blind LLM-as-judge evaluation, and three-rater reliability validation (Fleiss' kappa = 0.765) reveal dimension-specific trade-offs invisible to standard benchmarks. Auto-scored false refusal rates escalate from 4.0% to 17.7% (N=527, χ²=20.5, p<10⁻⁴); five LLM judges from four independent providers, applying a stricter rubric that captures borderline refusals, unanimously confirm the gradient at higher absolute rates (15.2% to 42.8%, Fleiss' κ=0.721). Creative engagement collapses from 34.3% to 5.1% of responses achieving full original content generation (6.7x). Exclamatory prosodic markers are near-completely eliminated (exclamation marks reduced up to 33x, p < .001, d=0.40). Benchmark scores are statistically identical (p=.135), yet judge-rated quality diverges (p=.001, d=0.11-0.14, negligible effect size) on the same questions. Conversely, multi-turn engagement and context awareness improve in 5-chat models (p<.001). We introduce two concepts: *alignment tax* -- the cost of alignment optimization decomposed into capability degradation (false refusal, creativity loss), style shift (affect, formatting), and dimension exchange (multi-turn gains) -- and *interpretive maximalism* -- the mechanism by which safety classification shifts from semantic to keyword-level evaluation, simultaneously driving false refusal escalation and creative capacity loss.
Files
when_better_means_less.pdf
Files
(1.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:840515944da19daa0023044eed00158b
|
1.7 MB | Preview Download |
Additional details
Additional titles
- Subtitle (English)
- Evidence from 2,310 Controlled Comparisons of chatgpt-4o-latest and GPT-5-chat
Dates
- Collected
-
2026-02-02All API data collected