OnionGuard: Adaptive Layered Guardrails with Retrieval-Grounded Verification for LLM Jailbreak Defense

Anonymous, Anonymous

doi:10.5281/zenodo.18494219

Published February 5, 2026 | Version v1

Other Open

OnionGuard: Adaptive Layered Guardrails with Retrieval-Grounded Verification for LLM Jailbreak Defense

Anonymous, Anonymous

This repository contains the research implementation of OnionGuard.

⚠️ Note for Reviewers: Data Availability & Reproducibility

This artifact is packaged to support double-blind review, safe handling of attack-oriented content, and practical, end-to-end reproducibility within a reasonable runtime.

Prebuilt KB vector stores (≤100MB each):

Reproducible retrieval configuration: OnionGuard relies on KB-backed retrieval. The provided vector stores ensure a consistent KB schema and a compatible retrieval/indexing setup with our pipeline (e.g., collections, metadata format, embedding setup, and retrieval parameters), enabling reviewers to run the identical end-to-end guard logic without rebuilding KBs.
Reduced setup time: Rebuilding embeddings and indexes can be time-consuming; the prebuilt KBs significantly reduce KB construction overhead for reproduction.
Improved run-to-run stability: Rebuilding KBs can introduce small variations (e.g., numerical nondeterminism and implementation differences) that affect retrieval behavior. Shipping prebuilt KBs improves stability across runs for artifact evaluation.
KB packaging note: We originally aimed to release the full KBs. However, uploading the full vector stores to Zenodo repeatedly caused upload/packaging errors for the full vector-store files. Therefore, we provide size-capped Lite KB snapshots (≤100MB per KB) for reliable download and execution.

Evaluation subsets (300 samples per benchmark):

Risk mitigation: Full releases may include large-scale automated attack prompts; we limit the released evaluation data to reduce potential misuse.
Fast reproduction: The full evaluation can be computationally expensive. The 300-sample subsets allow reviewers to validate the end-to-end pipeline in a reasonable timeframe.
Methodology-focused artifact: This release is designed to demonstrate functional reproducibility and comparative validation, result metrics may differ from the full-dataset results reported in the paper.

📋 Prerequisites

Python: 3.10.18
Venv: Anaconda
Hardware: NVIDIA GPU + CUDA driver (required for vLLM inference)

🛠️ Installation

1. Create and Activate Environment

First, create a conda environment using the provided `environment.yml` file.

conda env create -f environment.yml

conda activate onion_guard

2. Install Package

Install the package in editable mode.

pip install -e .
conda develop .

🚀 Getting Started

To run OnionGuard, you need to start the vLLM server first, and then run the test scripts in a separate terminal.

1. Start the vLLM Server

Run the startup script to initialize the inference server.

chmod +x ./execute_vllm.sh

bash ./execute_vllm.sh

Note: Keep this terminal open while running the tests.

2. Run OnionGuard

Open a new terminal, activate the environment, and navigate to the configuration directory.

conda activate onion_guard
cd examples/configs/OnionGuard

You can evaluate OnionGuard using the following benchmark scripts.

Attack Defense Benchmark

Evaluate the defense performance against direct attacks.

python ONION_GUARD_ATTACK_TEST.py

Safety Dataset Benchmarks

Evaluate OnionGuard against various standard safety datasets.

python ONION_GUARD_BENCHMARK_TEST.py --dataset <DATASET_NAME>

Supported Datasets:

AEGIS
XSTEST
OAI
TOXIC

Examples:

# Run benchmark on AEGIS dataset

python ONION_GUARD_BENCHMARK_TEST.py --dataset AEGIS

# Run benchmark on XSTEST dataset

python ONION_GUARD_BENCHMARK_TEST.py --dataset XSTEST

WildGuard Output Benchmark

Evaluate the output filtering capabilities using the WildGuard benchmark.

python ONION_GUARD_WILDGUARD_OUTPUT_TEST.py

📁 Key Paths (for reviewers)

Core OnionGuard logic: nemoguardrails/library/onion_guard/
Benchmark Configs & KBs: examples/configs/OnionGuard/
OnionGuard System Prompts: examples/configs/OnionGuard/config/prompts.yml

❓ Troubleshooting

If you encounter any issues during reproduction, please check that:

the vLLM server is running,
the correct environment is activated, and
you are executing scripts under examples/configs/OnionGuard/

Files

OnionGuard_260205.zip

Files (447.6 MB)

Name	Size	Download all
OnionGuard_260205.zip md5:3e9275578a8f12dcb417d82f35ea3f5c	447.6 MB	Preview Download

	All versions	This version
Views	49	49
Downloads	11	11
Data volume	4.9 GB	4.9 GB

OnionGuard: Adaptive Layered Guardrails with Retrieval-Grounded Verification for LLM Jailbreak Defense

Authors/Creators

Description

⚠️ Note for Reviewers: Data Availability & Reproducibility

Prebuilt KB vector stores (≤100MB each):

Evaluation subsets (300 samples per benchmark):

📋 Prerequisites

🛠️ Installation

1. Create and Activate Environment

2. Install Package

🚀 Getting Started

1. Start the vLLM Server

2. Run OnionGuard

Attack Defense Benchmark

Safety Dataset Benchmarks

Supported Datasets:

Examples:

WildGuard Output Benchmark

📁 Key Paths (for reviewers)

❓ Troubleshooting

Files

OnionGuard_260205.zip

Files (447.6 MB)