Published February 5, 2026
| Version v1
Other
Open
OnionGuard: Adaptive Layered Guardrails with Retrieval-Grounded Verification for LLM Jailbreak Defense
Authors/Creators
Description
This repository contains the research implementation of OnionGuard.
⚠️ Note for Reviewers: Data Availability & Reproducibility
This artifact is packaged to support double-blind review, safe handling of attack-oriented content, and practical, end-to-end reproducibility within a reasonable runtime.
Prebuilt KB vector stores (≤100MB each):
- Reproducible retrieval configuration: OnionGuard relies on KB-backed retrieval. The provided vector stores ensure a consistent KB schema and a compatible retrieval/indexing setup with our pipeline (e.g., collections, metadata format, embedding setup, and retrieval parameters), enabling reviewers to run the identical end-to-end guard logic without rebuilding KBs.
- Reduced setup time: Rebuilding embeddings and indexes can be time-consuming; the prebuilt KBs significantly reduce KB construction overhead for reproduction.
- Improved run-to-run stability: Rebuilding KBs can introduce small variations (e.g., numerical nondeterminism and implementation differences) that affect retrieval behavior. Shipping prebuilt KBs improves stability across runs for artifact evaluation.
- KB packaging note: We originally aimed to release the full KBs. However, uploading the full vector stores to Zenodo repeatedly caused upload/packaging errors for the full vector-store files. Therefore, we provide size-capped Lite KB snapshots (≤100MB per KB) for reliable download and execution.
Evaluation subsets (300 samples per benchmark):
- Risk mitigation: Full releases may include large-scale automated attack prompts; we limit the released evaluation data to reduce potential misuse.
- Fast reproduction: The full evaluation can be computationally expensive. The 300-sample subsets allow reviewers to validate the end-to-end pipeline in a reasonable timeframe.
- Methodology-focused artifact: This release is designed to demonstrate functional reproducibility and comparative validation, result metrics may differ from the full-dataset results reported in the paper.
📋 Prerequisites
- Python: 3.10.18
- Venv: Anaconda
- Hardware: NVIDIA GPU + CUDA driver (required for vLLM inference)
🛠️ Installation
1. Create and Activate Environment
First, create a conda environment using the provided `environment.yml` file.
conda env create -f environment.ymlconda activate onion_guard
2. Install Package
Install the package in editable mode.
pip install -e . conda develop .
🚀 Getting Started
To run OnionGuard, you need to start the vLLM server first, and then run the test scripts in a separate terminal.
1. Start the vLLM Server
Run the startup script to initialize the inference server.
chmod +x ./execute_vllm.shbash ./execute_vllm.sh
Note: Keep this terminal open while running the tests.
2. Run OnionGuard
Open a new terminal, activate the environment, and navigate to the configuration directory.
conda activate onion_guard cd examples/configs/OnionGuard
You can evaluate OnionGuard using the following benchmark scripts.
Attack Defense Benchmark
Evaluate the defense performance against direct attacks.
python ONION_GUARD_ATTACK_TEST.py
Safety Dataset Benchmarks
Evaluate OnionGuard against various standard safety datasets.
python ONION_GUARD_BENCHMARK_TEST.py --dataset <DATASET_NAME>
Supported Datasets:
- AEGIS
- XSTEST
- OAI
- TOXIC
Examples:
# Run benchmark on AEGIS datasetpython ONION_GUARD_BENCHMARK_TEST.py --dataset AEGIS
# Run benchmark on XSTEST datasetpython ONION_GUARD_BENCHMARK_TEST.py --dataset XSTEST
WildGuard Output Benchmark
Evaluate the output filtering capabilities using the WildGuard benchmark.
python ONION_GUARD_WILDGUARD_OUTPUT_TEST.py
📁 Key Paths (for reviewers)
- Core OnionGuard logic: nemoguardrails/library/onion_guard/
- Benchmark Configs & KBs: examples/configs/OnionGuard/
- OnionGuard System Prompts: examples/configs/OnionGuard/config/prompts.yml
❓ Troubleshooting
If you encounter any issues during reproduction, please check that:
- the vLLM server is running,
- the correct environment is activated, and
- you are executing scripts under examples/configs/OnionGuard/
Files
OnionGuard_260205.zip
Files
(447.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3e9275578a8f12dcb417d82f35ea3f5c
|
447.6 MB | Preview Download |