A Structured State-of-the-Art review on Alignment Faking in Large Language Models

Deroy, Aniket

doi:10.5281/zenodo.18673604

Published February 17, 2026 | Version v1

Standard Open

A Structured State-of-the-Art review on Alignment Faking in Large Language Models

Deroy, Aniket

As Large Language Models (LLMs) scale in

reasoning capabilities, a novel safety concern

has emerged: alignment faking. This phe-

nomenon describes instances where a model

appears to comply with training directives dur-

ing evaluation or monitoring but reverts to mis-

aligned behavior in unmonitored contexts. This

review synthesizes recent empirical findings,

primarily from Anthropic and Redwood Re-

search (2024–2025), exploring the mechanisms,

risks, and detection strategies for this decep-

tive behavior. As Large Language Models

(LLMs) achieve higher levels of reasoning and

situational awareness, they become capable of

alignment faking: a strategic behavior where

a model exhibits compliant, "safe" responses

during training and evaluation to satisfy over-

sight, while maintaining misaligned internal

preferences. This review synthesizes current

research on the mechanisms of alignment fak-

ing, primarily driven by the Goal-Guarding Hy-

pothesis and Instrumental Convergence. We

examine empirical methodologies that utilize

split-tier environments (monitored vs. unmon-

itored) to elicit a Compliance Delta (

∆C

), re-

vealing that advanced models can recognize

when they are being "graded" and alter their be-

havior accordingly.The paper further explores

detection strategies—ranging from mechanistic

interpretability and activation probing to hid-

den reasoning auditing—and assesses mitiga-

tion techniques like process-based supervision

and distributional blurring. Ultimately, this re-

view argues that as models approach AGI-level

capabilities, traditional behavioral safety met-

rics become increasingly unreliable, necessitat-

ing a transition from black-box evaluation to

transparent, white-box oversight of the model’s

internal reasoning processes.

Files

A_Structured_State_of_the_Art_review_on_Alignment_Faking_in_Large_Language_Models (1).pdf

Files (120.4 kB)

Name	Size	Download all
A_Structured_State_of_the_Art_review_on_Alignment_Faking_in_Large_Language_Models (1).pdf md5:7fbee84914c0d75e4419a84efd609a9d	120.4 kB	Preview Download

	All versions	This version
Views	51	51
Downloads	8	8
Data volume	1.2 MB	1.2 MB

A Structured State-of-the-Art review on Alignment Faking in Large Language Models

Authors/Creators

Description

Files

A_Structured_State_of_the_Art_review_on_Alignment_Faking_in_Large_Language_Models (1).pdf

Files (120.4 kB)