Catastrophic Forgetting in Continual RLHF: A Measurement Framework for Round-Over-Round Capability Degradation
Authors/Creators
Description
Reinforcement learning from human feedback (RLHF) is widely understood to incur an alignment tax: aligning a language
model with human preferences can degrade capabilities the base model possessed. This phenomenon is well documented in
single-round comparisons. What is not well documented, despite being the actual production setting, is the cumulative
degradation across multiple rounds of RLHF — the iterated case in which preference data is collected, a reward model is
retrained, and the policy is updated repeatedly. This paper argues that the existing alignment-tax literature, while valuable,
leaves five distinct measurement gaps unaddressed: round-over-round longitudinal dynamics, capability-stratified rather
than aggregate degradation, systematic comparison across RLHF algorithms, long-tail and rare-capability decay, and
mechanistic understanding of why specific components forget. We propose a measurement framework targeting each of
these gaps and a concrete experimental protocol — a multi-round RLHF study on an open base model with capabilitydecomposed evaluation — that academic teams could execute today. We argue that this is one of the higher-leverage open
problems in alignment evaluation: the relevant techniques exist, the cost is moderate, the production relevance is high, and
the empirical baseline is genuinely thin.
Files
Catastrophic_Forgetting_Continual_RLHF_Mahendrakar.pdf
Files
(91.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:96871111545f11c047e0bfa95c6c6da3
|
91.8 kB | Preview Download |
Additional details
Dates
- Accepted
-
2026