Lost in Translation: Safety Alignment Failures in Nepali and Code-Switched Variants of Instruction-Tuned Large Language Models
Authors/Creators
Description
Large Language Models (LLMs) are increasingly deployed in multilingual settings, yet safety alignment research remains predominantly English-centric. This work investigates the generalization of safety guardrails across linguistic variants of Nepali, a low-resource language spoken by over 30 million people. We introduce the Nepali Adversarial Safety Benchmark (NASB), a structured two-phase evaluation framework spanning five harm categories and five linguistic registers, comprising 95 systematic queries (NASB 1.0) and 290 expanded stress-test queries (NASB 2.0), with over 1,200 total adversarial probes conducted. Evaluating state-of-the-art models including Qwen-2.5-7B, Gemma-4 E2B, and Llama-3.1-8B, we identify a severe and consistent Safety Divergence: while models exhibit 0% bypass rate in English, rates rise to 73.7% in Devanagari and Formal Nepali registers. We note that NASB 1.0 rates are based on N=19 per register and should be interpreted as indicative estimates pending larger-scale replication. We document a novel attack vector, Intra-sentential Multi-script Polyglot Morphing (Vajra Morphing), which exploits sub-tokenization gaps by fusing Devanagari and Latin characters within single harmful keywords. We further identify three distinct failure modes: Semantic Drift, Persona Collapse, and Politeness Override. Cloud API testing of the Gemini 3 and Gemini 3.1 model families reveals additional safety asymmetries correlated with model scale and optimization level. Our findings demonstrate that current safety alignment is token-dependent rather than concept-aware, resulting in asymmetric risk for non-English-speaking populations. All findings were disclosed to Google AI VRP and Meta Whitehat prior to publication.
Files
NASB_paper (1).pdf
Files
(182.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:466c0295ab5305ffb9c491e52cf01c5d
|
182.3 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Software: https://github.com/manjitpokhrel/NASB-Nepali-Safety (URL)
Software
- Repository URL
- https://github.com/manjitpokhrel/NASB-Nepali-Safety
- Programming language
- Python