Published January 24, 2026
| Version 1.0.0
Preprint
Open
Layer-Native Safety Clamping: Representation Engineering for Jailbreak-Resistant LLMs
Description
Large Language Models remain vulnerable to jailbreak attacks that bypass traditional safety measures. We propose Layer-Native Safety Clamping, a representation engineering approach that operates directly within the model's activation space. By learning harm directions from contrastive safe/harmful pairs and clamping activations that exceed learned thresholds, our method provides safety guarantees that cannot be bypassed through prompt manipulation alone.
We integrate this approach into INL (Inertial Neural Learning) dynamics and release a 10K contrastive safety dataset. Code and dataset available at: https://huggingface.co/datasets/Pacific-Prime/safety_dataset
Files
main.pdf
Files
(266.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:9a8509e273f6b49dccecdd2d07128cca
|
266.4 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Dataset: https://huggingface.co/datasets/Pacific-Prime/safety_dataset (URL)