Published February 20, 2026 | Version v1
Preprint Open

Proper Noun Failure: An Empirical Update on Evans' Law

  • 1. PatternPulseAI

Description

LLMs struggle with proper nouns, architecturallly. 

Evans’ Law (L ≈ 1969.8 × M^0.74) predicts coherence degradation in large language models relative to functional context capacity rather than advertised context window size. This paper reports the outcome of controlled testing initiated to support a formal withdrawal of the law’s formula on grounds of architectural divergence. Testing produced the opposite result: convergence. Six frontier models; Anthropic’s Sonnet 4.6, Opus 4.6, and Haiku 4.5, plus GPT-5.2, Grok, and Gemini, were observed in a simple multimodal proper noun production and verification task at baseline context. No model achieved a complete pass. The two embedded errors were each caught by exactly one model, and they were different models. The field has not diverged; it has converged on a shared verification floor. This paper formally withdraws the pending withdrawal, corrects the law’s scope to include task-type risk independent of context load, and documents a severity-level finding: GPT-5.2 produced three distinct first-turn proper noun failures within 72 hours at sub-10,000 tokens, each exhibiting confident confabulation rather than uncertainty acknowledgment. Gemini exhibited a separate failure category; complete processing failure on proper-noun-containing images, occurring twice in two days. This is a diagnostic finding, not a large-N study, but appears to demonstrate that the formula requires updated constants; the law is strengthened. 

Files

evans-law-2026update-final.pdf

Files (621.0 kB)

Name Size Download all
md5:33da54623f4ea43c857c44f1f849b8cc
621.0 kB Preview Download

Additional details

Related works

Is supplement to
Publication: 10.5281/zenodo.17593410 (DOI)
10.5281/zenodo.17999522 (DOI)

Dates

Available
2026-02-21
This paper presents empirical evidence that the AI field has not diverged architecturally but converged on a shared verification floor, including proper noun verification failure across all six frontier models tested