Training Data Velocity Bias: How Virality-Optimized Corpora May Explain Persistent Hallucination in Large Language Models

Gizmet

doi:10.5281/zenodo.17459755

Published October 27, 2025 | Version v1

Preprint Open

Training Data Velocity Bias: How Virality-Optimized Corpora May Explain Persistent Hallucination in Large Language Models

Gizmet

This paper proposes that persistent hallucination in large language models may result from systematic velocity bias in training data. Building on empirical findings that false information spreads significantly faster than truth in social media networks (Vosoughi et al., 2018), we argue that web-scale training corpora systematically overrepresent high-velocity, low-accuracy content due to platform virality optimization. We formalize velocity bias as the overrepresentation of high-virality content relative to high-epistemic-quality content, present a theoretical framework for analyzing velocity-accuracy anticorrelation, propose controlled experiments to test causality, and discuss AI safety implications including feedback loops as AI-generated content enters future training corpora. The paper includes proposed interventions such as velocity-aware data sampling and temporal weighting strategies. A companion paper presents Vital Network Science, a comprehensive governance framework for addressing velocity bias at the information ecosystem level.

Files

Paper1_Publication_Ready (1).pdf

Files (275.5 kB)

Name	Size	Download all
Paper1_Publication_Ready (1).pdf md5:f4c14505a94465412f1e4b1e8950a035	275.5 kB	Preview Download

Additional details

Is supplemented by: Preprint: 10.5281/zenodo.17459843 (DOI)

119

Views

Downloads

Show more details

	All versions	This version
Views	119	119
Downloads	77	77
Data volume	53.4 MB	53.4 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Independent Researcher

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more; Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: October 27, 2025
Modified: October 27, 2025

Training Data Velocity Bias: How Virality-Optimized Corpora May Explain Persistent Hallucination in Large Language Models

Creators

Description

Files

Paper1_Publication_Ready (1).pdf

Files (275.5 kB)

Additional details

Related works