Proper Noun Failure: An Empirical Update on Evans' Law
Description
LLMs struggle with proper nouns, architecturallly.
Evans’ Law (L ≈ 1969.8 × M^0.74) predicts coherence degradation in large language models relative to functional context capacity rather than advertised context window size. This paper reports the outcome of controlled testing initiated to support a formal withdrawal of the law’s formula on grounds of architectural divergence. Testing produced the opposite result: convergence. Six frontier models; Anthropic’s Sonnet 4.6, Opus 4.6, and Haiku 4.5, plus GPT-5.2, Grok, and Gemini, were observed in a simple multimodal proper noun production and verification task at baseline context. No model achieved a complete pass. The two embedded errors were each caught by exactly one model, and they were different models. The field has not diverged; it has converged on a shared verification floor. This paper formally withdraws the pending withdrawal, corrects the law’s scope to include task-type risk independent of context load, and documents a severity-level finding: GPT-5.2 produced three distinct first-turn proper noun failures within 72 hours at sub-10,000 tokens, each exhibiting confident confabulation rather than uncertainty acknowledgment. Gemini exhibited a separate failure category; complete processing failure on proper-noun-containing images, occurring twice in two days. This is a diagnostic finding, not a large-N study, but appears to demonstrate that the formula requires updated constants; the law is strengthened.
Files
evans-law-2026update-final.pdf
Files
(621.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:33da54623f4ea43c857c44f1f849b8cc
|
621.0 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Publication: 10.5281/zenodo.17593410 (DOI)
- 10.5281/zenodo.17999522 (DOI)
Dates
- Available
-
2026-02-21This paper presents empirical evidence that the AI field has not diverged architecturally but converged on a shared verification floor, including proper noun verification failure across all six frontier models tested