Published January 24, 2026 | Version 1.0.0
Preprint Open

Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs

  • 1. Independent Researcher

Description

Large Language Models remain vulnerable to jailbreak attacks that bypass traditional safety measures. We propose Layer-Native Safety Clamping, a representation engineering approach that operates directly within the model's activation space. By learning harm directions from contrastive safe/harmful pairs and clamping activations that exceed learned thresholds, our method provides safety guarantees that cannot be bypassed through prompt manipulation alone.

We integrate this approach into INL (Inertial Neural Learning) dynamics and release a 10K contrastive safety dataset. Code and dataset available at: https://huggingface.co/datasets/Pacific-Prime/safety_dataset

Files

main.pdf

Files (266.4 kB)

Name Size Download all
md5:9a8509e273f6b49dccecdd2d07128cca
266.4 kB Preview Download

Additional details

Related works