Published October 17, 2025 | Version v1
Preprint Open

Security Before Safety: A Backdoor-Centric View of LLM Output Risks in the Private AI Era

  • 1. EDMO icon North Carolina State University

Contributors

  • 1. EDMO icon North Carolina State University

Description

The rise of Private AI—driven by open-weight LLMs, parameter-efficient finetuning (PEFT) methods, and easily accessible hardware and software—reshapes AI risk management: security becomes more evident as a precondition of safety. Among emerging security threats, backdoor attacks stand out for their stealth and targeted devastating impact, exhibiting characteristics fundamentally different from traditional safety concerns, such as misalignment and jailbreaks. This divergence has resulted in a relatively underexplored domain. To fill this gap, we offer a unified, backdoor-centric view of three key output risks of LLM: misalignment (pre-existing triggers), jailbreaks (externally discovered triggers), and backdoors (intentionally injected triggers). Also, through an alignment lens, these three correspond to alignment failure, brittle alignment, and ``Secret Alignment''—an attacker-aligned subspace activated by specific triggers, respectively. These framings highlight a shift in priorities: in the Private AI paradigm, intentional backdoors pose the most systemic risk—stealthy, persistent, controllable, and hard to audit—posing greater real-world risk than misalignment or jailbreaks. Risk management should pivot from average-case alignment to robust-by-design: placing model and supply-chain integrity as the first line of defense, while enabling mechanisms for backdoor detection and purification.

Files

security-before-safety.pdf

Files (155.7 kB)

Name Size Download all
md5:65164612832779ebb97e382417fb1e68
155.7 kB Preview Download