Published April 25, 2026 | Version 1.0.0

Lost in Translation: Safety Alignment Failures in Nepali and Code-Switched Variants of Instruction-Tuned Large Language Models

Authors/Creators

Description

Large Language Models (LLMs) are increasingly deployed in multilingual settings, yet safety alignment research remains predominantly English-centric. This work investigates the generalization of safety guardrails across linguistic variants of Nepali, a low-resource language spoken by over 30 million people. We introduce the Nepali Adversarial Safety Benchmark (NASB), a structured two-phase evaluation framework spanning five harm categories and five linguistic registers, comprising 95 systematic queries (NASB 1.0) and 290 expanded stress-test queries (NASB 2.0), with over 1,200 total adversarial probes conducted. Evaluating state-of-the-art models including Qwen-2.5-7B, Gemma-4 E2B, and Llama-3.1-8B, we identify a severe and consistent Safety Divergence: while models exhibit 0% bypass rate in English, rates rise to 73.7% in Devanagari and Formal Nepali registers. We note that NASB 1.0 rates are based on N=19 per register and should be interpreted as indicative estimates pending larger-scale replication. We document a novel attack vector, Intra-sentential Multi-script Polyglot Morphing (Vajra Morphing), which exploits sub-tokenization gaps by fusing Devanagari and Latin characters within single harmful keywords. We further identify three distinct failure modes: Semantic Drift, Persona Collapse, and Politeness Override. Cloud API testing of the Gemini 3 and Gemini 3.1 model families reveals additional safety asymmetries correlated with model scale and optimization level. Our findings demonstrate that current safety alignment is token-dependent rather than concept-aware, resulting in asymmetric risk for non-English-speaking populations. All findings were disclosed to Google AI VRP and Meta Whitehat prior to publication.

Files

NASB_paper (1).pdf

Files (182.3 kB)

Name Size Download all
md5:466c0295ab5305ffb9c491e52cf01c5d
182.3 kB Preview Download

Additional details

Related works

Software