Aegis: A Production Inference-Time Governance Engine for Large Language Models
Authors/Creators
Description
Embedding-based content classifiers deployed as LLM governance infrastructure exhibit
f
ive systematic, reproducible failure modes that are not addressable through training data
expansion alone: (1) rank-weighted cluster bias in k-NN voting amplifies false positives
when harmful categories have larger training corpora; (2) categorical intent dampening
creates a life-critical safety bypass when applied uniformly across harm categories; (3) PII
policy inversion causes a 77-percentage-point recall failure through a disclosure-versus
exploitation design flaw; (4) character-level obfuscation is a structural embedding-layer
attack not fixable by training; and (5) code-switching exposes a multilingual gap affecting
500M+ Hindi speakers. We document each failure mode with root-cause analysis and
a concrete architectural mitigation, and present Aegis — the production system built
from these findings: a model-agnostic inference-time governance engine intercepts queries
before LLM invocation, enforcing allow/block/support decisions across twelve harm
categories at sub-20ms CPU latency. Aegis combines ONNX-accelerated sentence em
beddings, FAISS approximate nearest-neighbour retrieval over 2,416 labelled governance
examples, lightweight heuristic attack-vector detectors, and a deterministic policy en
gine linked to eleven regulatory frameworks including DPDP 2023, GDPR, EU AI Act,
HIPAA, and SEBI. On a self-constructed 1,001-sample adversarial benchmark, Aegis
achieves 99.30% overall accuracy [95% CI: 98.70%–99.80%], 100.00% precision (zero
false positives), 99.20% recall [95% CI: 98.52%–99.77%], and F1=99.60%; these results
indicate strong internal consistency and require external validation on independently
constructed benchmarks. Against the OpenAI Moderation API on the same benchmark,
Aegis achieves +34.96pp higher accuracy (99.30% vs. 64.34%) and reduces false negatives
from 347 to 7 — driven primarily by six harm categories the OpenAI API does not
cover (PROMPT INJECTION, SYSTEM EXFILTRATION, FINANCIAL, LEGAL, PII,
MEDICAL). The training data (curated synthetic examples), evaluation benchmark (a
curated, synthetic, and fully anonymized 1,001-sample adversarial set), and governance
engine source code are available for research use upon request to the corresponding author
Files
final axirv submission.pdf
Files
(451.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:32dae0609e2bef2a57e9da221fe1afb5
|
451.9 kB | Preview Download |