Published February 11, 2026 | Version v1
Preprint Open

Hospitals, Helpdesks, Codebases, and Beyond: One LLM Collapse Law Across Domains

Description

Reliable deployment of large language models (LLMs) remains limited not primarily by average capability, but by tail failures that surface when real systems encounter subgroup rarity, venue shift, long context, and multi-step execution. These failures often look inconsistent or “mysterious” because standard evaluation compresses heterogeneous regimes into a single denominator, allowing global scores to remain high while critical cohorts degrade, rankings invert, and brittle shortcuts survive. We frame reliability as a field property rather than a model trait by operating at the level of Large Language Fields (LLFs): the layer where tasks, venues, constraints, budgets, admissibility rules, and evidence flows determine what can safely and stably be done in practice. Within this framing, we formalize a compact collapse law for reliability in the wild: the LLF Collapse Identity (ICC), R(x) + C(x) = 1, where R(x) denotes reliability that survives explicit subfield constraints under context x, and C(x) denotes collapse share — the fraction of apparent performance that does not survive tail conditions. We operationalize ICC through a practical metric stack designed for promotion governance: the Field Collapse Index (FCI), which measures the gap between global and tail performance under an explicit Subfield Stack; Inversion Rate (IR), which detects ranking reversals on critical subfields even when global metrics improve; and a Syntactic Reliance Score (SRS), which exposes shortcut behavior driven by interface artifacts, templates, labels, or tool/prompt syntax rather than stable semantics. To make promotion decisions replayable and contestable, we specify a minimal evaluation receipt schema that records field definitions, subfield rules, metrics, uncertainty bounds, derived collapse measures, and gate outcomes. Across a cross-domain suite of 24 scenarios spanning hospitals, helpdesks, codebases, and additional operational settings, we show that collapse share consistently tracks where systems break, predicts operational surprise better than global scores alone, and provides a simple, auditable basis for blocking, holding, or promoting models based on bounded subfield behavior.

Files

Hospitals, Helpdesks, Codebases, and Beyond: One LLM Collapse Law Across Domains.pdf