Catastrophic Forgetting in Continual RLHF: A Measurement Framework for Round-Over-Round Capability Degradation

Mahendrakar, Pranay

doi:10.5281/zenodo.19853746

Published April 28, 2026 | Version v1

Publication Open

Catastrophic Forgetting in Continual RLHF: A Measurement Framework for Round-Over-Round Capability Degradation

Mahendrakar, Pranay

Reinforcement learning from human feedback (RLHF) is widely understood to incur an alignment tax: aligning a language
model with human preferences can degrade capabilities the base model possessed. This phenomenon is well documented in
single-round comparisons. What is not well documented, despite being the actual production setting, is the cumulative
degradation across multiple rounds of RLHF — the iterated case in which preference data is collected, a reward model is
retrained, and the policy is updated repeatedly. This paper argues that the existing alignment-tax literature, while valuable,
leaves five distinct measurement gaps unaddressed: round-over-round longitudinal dynamics, capability-stratified rather
than aggregate degradation, systematic comparison across RLHF algorithms, long-tail and rare-capability decay, and
mechanistic understanding of why specific components forget. We propose a measurement framework targeting each of
these gaps and a concrete experimental protocol — a multi-round RLHF study on an open base model with capabilitydecomposed evaluation — that academic teams could execute today. We argue that this is one of the higher-leverage open
problems in alignment evaluation: the relevant techniques exist, the cost is moderate, the production relevance is high, and
the empirical baseline is genuinely thin.

Files

Catastrophic_Forgetting_Continual_RLHF_Mahendrakar.pdf

Files (91.8 kB)

Name	Size	Download all
Catastrophic_Forgetting_Continual_RLHF_Mahendrakar.pdf md5:96871111545f11c047e0bfa95c6c6da3	91.8 kB	Preview Download

Additional details

Accepted: 2026

	All versions	This version
Views	18	18
Downloads	3	3
Data volume	367.2 kB	367.2 kB

Catastrophic Forgetting in Continual RLHF: A Measurement Framework for Round-Over-Round Capability Degradation

Authors/Creators

Description

Files

Catastrophic_Forgetting_Continual_RLHF_Mahendrakar.pdf

Files (91.8 kB)

Additional details

Dates