The When Where and How of LLM Failures, Measured
Description
Large language models (LLMs) increasingly support high-stakes decision-making in medicine, law, education, finance, and governance. Yet their reliability in extended interactions remains poorly understood. This paper introduces Evans’ Law, an empirically derived scaling framework demonstrating that long-context coherence collapses according to predictable power-law relationships governed by model size, not advertised context window. Using structured long-form conversations across 11+ models from six major vendors, we show that text-only coherence thresholds follow
L ≈ 1969.8 × M^0.74
while multimodal systems follow a steeper degradation law,
L_multi ≈ 582.5 × M^0.64,
imposing a 60–80% reduction in functional capacity. These findings hold across diverse architectures (dense and MoE) and parameter scales from 7B to 1T+, and align with independent early replications.
To characterize how degradation manifests, we introduce a revised Aggregate Coherence Index (ACI), a behavioral scoring system capturing the transition from stable reasoning to incoherence, hallucination, and signature-specific failure modes. Together, Evans’ Law and the ACI framework provide the first generalizable method for estimating where collapse will occur and identifying how it unfolds.
We present detailed methodology for drag calibration, checkpoint evaluation, multimodal testing, and log–log regression, along with practical coherence thresholds for common parameter scales. The results expose a systematic gap between vendor-marketed context windows and actual functional reliability, with implications for safety, deployment, and regulatory disclosure.
Finally, we outline a forward research agenda spanning replication, architecture-specific scaling laws, drag quantification, multimodal degradation mechanisms, real-time coherence monitoring, and domain-specific safety thresholds. By providing a reproducible empirical foundation, this work establishes the basis for operationalizing a new field, AI Conversation Phenomenology, and a path toward coherent, reliable long-context systems.
Files
The When Where and How of LLM FailuresFINAL.pdf
Files
(796.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:df0d8e1530e6c57d30423a09f6018826
|
796.9 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Publication: 10.5281/zenodo.17660343 (DOI)
- Publication: 10.5281/zenodo1.7671490 (DOI)
Dates
- Available
-
2025-11-23