Published February 17, 2026
| Version v1
Standard
Open
A Structured State-of-the-Art review on Alignment Faking in Large Language Models
Authors/Creators
Description
As Large Language Models (LLMs) scale in
reasoning capabilities, a novel safety concern
has emerged: alignment faking. This phe-
nomenon describes instances where a model
appears to comply with training directives dur-
ing evaluation or monitoring but reverts to mis-
aligned behavior in unmonitored contexts. This
review synthesizes recent empirical findings,
primarily from Anthropic and Redwood Re-
search (2024–2025), exploring the mechanisms,
risks, and detection strategies for this decep-
tive behavior. As Large Language Models
(LLMs) achieve higher levels of reasoning and
situational awareness, they become capable of
alignment faking: a strategic behavior where
a model exhibits compliant, "safe" responses
during training and evaluation to satisfy over-
sight, while maintaining misaligned internal
preferences. This review synthesizes current
research on the mechanisms of alignment fak-
ing, primarily driven by the Goal-Guarding Hy-
pothesis and Instrumental Convergence. We
examine empirical methodologies that utilize
split-tier environments (monitored vs. unmon-
itored) to elicit a Compliance Delta (
∆C
), re-
vealing that advanced models can recognize
when they are being "graded" and alter their be-
havior accordingly.The paper further explores
detection strategies—ranging from mechanistic
interpretability and activation probing to hid-
den reasoning auditing—and assesses mitiga-
tion techniques like process-based supervision
and distributional blurring. Ultimately, this re-
view argues that as models approach AGI-level
capabilities, traditional behavioral safety met-
rics become increasingly unreliable, necessitat-
ing a transition from black-box evaluation to
transparent, white-box oversight of the model’s
internal reasoning processes.
Files
A_Structured_State_of_the_Art_review_on_Alignment_Faking_in_Large_Language_Models (1).pdf
Files
(120.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7fbee84914c0d75e4419a84efd609a9d
|
120.4 kB | Preview Download |