Published June 2, 2026 | Version v1
Preprint Open

ERRORQUAKE:Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

Authors/Creators

Description

 This preprint introduces ERRORQUAKE, a benchmark and analysis framework for measuring not only whether large language models make
  errors, but how severe those errors are. The study evaluates 21 open-weight LLMs on 10,000 queries across 8 domains and 5
  difficulty tiers using a continuous 9-level error severity scale.

  ERRORQUAKE models error severity distributions as heavy-tailed phenomena, drawing on a Gutenberg-Richter-style tail index to
  characterize how often models produce high-severity failures. The paper reports that models with similar aggregate error rates can
  differ substantially in the severity profile of their mistakes, including matched-accuracy model pairs with disjoint confidence
  intervals for the severity distribution index.

  The work contributes a severity-aware evaluation paradigm for open-weight LLMs, distributional evidence that scalar accuracy can
  obscure important differences in model risk, and robustness analyses including bootstrap confidence intervals, sensitivity checks,
  and human-audit validation.

Files

errorquake.pdf

Files (853.5 kB)

Name Size Download all
md5:b5055b44e44a54588fd19956dd8cbb47
853.5 kB Preview Download

Additional details

Software

Programming language
Python