Published October 27, 2025 | Version v1
Preprint Open

Training Data Velocity Bias: How Virality-Optimized Corpora May Explain Persistent Hallucination in Large Language Models

Creators

Description

This paper proposes that persistent hallucination in large language models may result from systematic velocity bias in training data. Building on empirical findings that false information spreads significantly faster than truth in social media networks (Vosoughi et al., 2018), we argue that web-scale training corpora systematically overrepresent high-velocity, low-accuracy content due to platform virality optimization. We formalize velocity bias as the overrepresentation of high-virality content relative to high-epistemic-quality content, present a theoretical framework for analyzing velocity-accuracy anticorrelation, propose controlled experiments to test causality, and discuss AI safety implications including feedback loops as AI-generated content enters future training corpora. The paper includes proposed interventions such as velocity-aware data sampling and temporal weighting strategies. A companion paper presents Vital Network Science, a comprehensive governance framework for addressing velocity bias at the information ecosystem level.

Files

Paper1_Publication_Ready (1).pdf

Files (275.5 kB)

Name Size Download all
md5:f4c14505a94465412f1e4b1e8950a035
275.5 kB Preview Download

Additional details

Related works

Is supplemented by
Preprint: 10.5281/zenodo.17459843 (DOI)