ERRORQUAKE:Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models
Authors/Creators
Description
This preprint introduces ERRORQUAKE, a benchmark and analysis framework for measuring not only whether large language models make
errors, but how severe those errors are. The study evaluates 21 open-weight LLMs on 10,000 queries across 8 domains and 5
difficulty tiers using a continuous 9-level error severity scale.
ERRORQUAKE models error severity distributions as heavy-tailed phenomena, drawing on a Gutenberg-Richter-style tail index to
characterize how often models produce high-severity failures. The paper reports that models with similar aggregate error rates can
differ substantially in the severity profile of their mistakes, including matched-accuracy model pairs with disjoint confidence
intervals for the severity distribution index.
The work contributes a severity-aware evaluation paradigm for open-weight LLMs, distributional evidence that scalar accuracy can
obscure important differences in model risk, and robustness analyses including bootstrap confidence intervals, sensitivity checks,
and human-audit validation.
Files
errorquake.pdf
Files
(853.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b5055b44e44a54588fd19956dd8cbb47
|
853.5 kB | Preview Download |
Additional details
Software
- Programming language
- Python