Training Data Velocity Bias: How Virality-Optimized Corpora May Explain Persistent Hallucination in Large Language Models
Creators
Description
This paper proposes that persistent hallucination in large language models may result from systematic velocity bias in training data. Building on empirical findings that false information spreads significantly faster than truth in social media networks (Vosoughi et al., 2018), we argue that web-scale training corpora systematically overrepresent high-velocity, low-accuracy content due to platform virality optimization. We formalize velocity bias as the overrepresentation of high-virality content relative to high-epistemic-quality content, present a theoretical framework for analyzing velocity-accuracy anticorrelation, propose controlled experiments to test causality, and discuss AI safety implications including feedback loops as AI-generated content enters future training corpora. The paper includes proposed interventions such as velocity-aware data sampling and temporal weighting strategies. A companion paper presents Vital Network Science, a comprehensive governance framework for addressing velocity bias at the information ecosystem level.
Files
Paper1_Publication_Ready (1).pdf
Files
(275.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:f4c14505a94465412f1e4b1e8950a035
|
275.5 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Preprint: 10.5281/zenodo.17459843 (DOI)