Published March 13, 2026 | Version v1
Preprint Open

The AI-Induced Subjectivity Crisis Series, Paper 5: The Philosophical Impossibility of Value Alignment: Temporal Fixation, Ontological Compression, and the Failure of RLHF

Authors/Creators

Description

This paper constitutes Paper 5 of the AI-Induced Subjectivity Crisis Series. This paper argues that value alignment in the RLHF sense is a philosophically impossible task. Existing critiques of RLHF target implementation defects—annotator bias, insufficient diversity, competing objectives—and thereby misidentify the nature of the problem. RLHF's difficulty lies not in its execution but in the untenability of its philosophical premises, which fail on two distinct levels here termed dual dimensionality reduction.

The first dimensionality reduction is epistemological. RLHF presupposes a stable, capturable object—"correct human values"—that does not exist. Value judgments are temporal, socially constructed, and lack the external calibration anchor that would permit progressive approximation toward correctness. More critically, once LLMs reach sufficient scale to shape social cognition, RLHF's existence corrodes its own reference system: the values it aligns to are already partially produced by its own operation. The reference system undergoes reflexive dissolution.

The second dimensionality reduction is ontological. RLHF compresses multi-dimensional embodied existence into linguistic preference rankings, presupposing that language adequately represents the full basis of human judgment. It does not. Human meaning is rooted in embodied experience, temporal accumulation, vulnerability, and the capacity to bear consequences—dimensions for which irreducible information loss occurs at the linguistic level. AI systems produce meaning structures of their own operational logic, but these are heterogeneous in kind from human embodied meaning; to substitute one for the other is a category mistake, not an approximation.

These two reductions share a common structure: the compression of a high-dimensional, dynamic, irreducible reality into a low-dimensional, static, operationalizable symbolic system, with the product of compression claimed as adequate representation. Standard engineering remedies—increased training data, expanded annotator samples, multimodal inputs, continuous updating—all fail because they operate within RLHF's operational logic rather than addressing the untenability of its premises. The paper concludes by arguing that "aligning to human values" must be abandoned as a governing framework, and that the productive question is not how to align better but what kind of thing human values are and what relationship between AI and human beings their nature actually permits.

Preprint version. This manuscript has not yet undergone peer review.

Files

The Philosophical Impossibility of Value Alignment.pdf

Files (662.2 kB)

Additional details

Dates

Created
2026-03-13