Published April 28, 2026 | Version v1
Publication Open

Catastrophic Forgetting in Continual RLHF: A Measurement Framework for Round-Over-Round Capability Degradation

Authors/Creators

Description

Reinforcement learning from human feedback (RLHF) is widely understood to incur an alignment tax: aligning a language
model with human preferences can degrade capabilities the base model possessed. This phenomenon is well documented in
single-round comparisons. What is not well documented, despite being the actual production setting, is the cumulative
degradation across multiple rounds of RLHF — the iterated case in which preference data is collected, a reward model is
retrained, and the policy is updated repeatedly. This paper argues that the existing alignment-tax literature, while valuable,
leaves five distinct measurement gaps unaddressed: round-over-round longitudinal dynamics, capability-stratified rather
than aggregate degradation, systematic comparison across RLHF algorithms, long-tail and rare-capability decay, and
mechanistic understanding of why specific components forget. We propose a measurement framework targeting each of
these gaps and a concrete experimental protocol — a multi-round RLHF study on an open base model with capabilitydecomposed evaluation — that academic teams could execute today. We argue that this is one of the higher-leverage open
problems in alignment evaluation: the relevant techniques exist, the cost is moderate, the production relevance is high, and
the empirical baseline is genuinely thin.

Files

Catastrophic_Forgetting_Continual_RLHF_Mahendrakar.pdf

Files (91.8 kB)

Additional details

Dates

Accepted
2026