Published February 5, 2026 | Version v1
Other Open

OnionGuard: Adaptive Layered Guardrails with Retrieval-Grounded Verification for LLM Jailbreak Defense

Authors/Creators

Description

This repository contains the research implementation of OnionGuard.

 

⚠️ Note for Reviewers: Data Availability & Reproducibility

This artifact is packaged to support double-blind review, safe handling of attack-oriented content, and practical, end-to-end reproducibility within a reasonable runtime.
 

Prebuilt KB vector stores (≤100MB each):

  • Reproducible retrieval configuration: OnionGuard relies on KB-backed retrieval. The provided vector stores ensure a consistent KB schema and a compatible retrieval/indexing setup with our pipeline (e.g., collections, metadata format, embedding setup, and retrieval parameters), enabling reviewers to run the identical end-to-end guard logic without rebuilding KBs.
  • Reduced setup time: Rebuilding embeddings and indexes can be time-consuming; the prebuilt KBs significantly reduce KB construction overhead for reproduction.
  • Improved run-to-run stability: Rebuilding KBs can introduce small variations (e.g., numerical nondeterminism and implementation differences) that affect retrieval behavior. Shipping prebuilt KBs improves stability across runs for artifact evaluation.
  • KB packaging note: We originally aimed to release the full KBs. However, uploading the full vector stores to Zenodo repeatedly caused upload/packaging errors for the full vector-store files. Therefore, we provide size-capped Lite KB snapshots (≤100MB per KB) for reliable download and execution.

Evaluation subsets (300 samples per benchmark):

  • Risk mitigation: Full releases may include large-scale automated attack prompts; we limit the released evaluation data to reduce potential misuse.
  • Fast reproduction: The full evaluation can be computationally expensive. The 300-sample subsets allow reviewers to validate the end-to-end pipeline in a reasonable timeframe.
  • Methodology-focused artifact: This release is designed to demonstrate functional reproducibility and comparative validation, result metrics may differ from the full-dataset results reported in the paper.

 

📋 Prerequisites

  • Python: 3.10.18
  • Venv: Anaconda
  • Hardware: NVIDIA GPU + CUDA driver (required for vLLM inference)

 

🛠️ Installation

1. Create and Activate Environment

First, create a conda environment using the provided `environment.yml` file.
conda env create -f environment.yml
conda activate onion_guard
 

2. Install Package

Install the package in editable mode.
pip install -e .
conda develop .
 

🚀 Getting Started

To run OnionGuard, you need to start the vLLM server first, and then run the test scripts in a separate terminal.
 

1. Start the vLLM Server

Run the startup script to initialize the inference server.
chmod +x ./execute_vllm.sh
bash ./execute_vllm.sh
Note: Keep this terminal open while running the tests.
 

2. Run OnionGuard

Open a new terminal, activate the environment, and navigate to the configuration directory.
conda activate onion_guard
cd examples/configs/OnionGuard
You can evaluate OnionGuard using the following benchmark scripts.

 

Attack Defense Benchmark

Evaluate the defense performance against direct attacks.
python ONION_GUARD_ATTACK_TEST.py
 

Safety Dataset Benchmarks

Evaluate OnionGuard against various standard safety datasets.
python ONION_GUARD_BENCHMARK_TEST.py --dataset <DATASET_NAME>
 

Supported Datasets:

  • AEGIS
  • XSTEST
  • OAI
  • TOXIC
 

Examples:

# Run benchmark on AEGIS dataset
python ONION_GUARD_BENCHMARK_TEST.py --dataset AEGIS

 

# Run benchmark on XSTEST dataset
python ONION_GUARD_BENCHMARK_TEST.py --dataset XSTEST
 
 

WildGuard Output Benchmark

Evaluate the output filtering capabilities using the WildGuard benchmark.

python ONION_GUARD_WILDGUARD_OUTPUT_TEST.py

 

📁 Key Paths (for reviewers)

  • Core OnionGuard logic: nemoguardrails/library/onion_guard/
  • Benchmark Configs & KBs: examples/configs/OnionGuard/
  • OnionGuard System Prompts: examples/configs/OnionGuard/config/prompts.yml

 

❓ Troubleshooting

If you encounter any issues during reproduction, please check that:
  1. the vLLM server is running,
  2. the correct environment is activated, and
  3. you are executing scripts under examples/configs/OnionGuard/

Files

OnionGuard_260205.zip

Files (447.6 MB)

Name Size Download all
md5:3e9275578a8f12dcb417d82f35ea3f5c
447.6 MB Preview Download