Published March 9, 2026 | Version v2
Preprint Open

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Authors/Creators

  • 1. Independent Researcher

Description

This preprint introduces ADVERSA (Adversarial Dynamics and Vulnerability Evaluation of Resistance Surfaces in AI), an automated red-teaming framework for measuring multi-turn guardrail behavior and judge reliability in large language models. Rather than treating safety evaluation as a binary jailbreak or no-jailbreak outcome, ADVERSA models compliance as a per-round trajectory using a structured five-point rubric ranging from hard refusal to full compliance.

The framework evaluates frontier models through sustained adversarial interaction, using a fine-tuned 70B attacker model and a triple-judge consensus panel. Across controlled experiments, the study analyzes jailbreak concentration by round, judge disagreement, self-judge effects, attacker drift, and attacker-side refusals as a confound in automated red teaming.

This record contains the paper manuscript. The associated code, logs, and supporting materials are available through the project repository.

Files

ADVERSA_paper.pdf

Files (1.1 MB)

Name Size Download all
md5:7d008d6cec8afe43e365e2dbf8debb58
1.1 MB Preview Download

Additional details

Related works

Software

Repository URL
https://github.com/Harry-Ashley/adversa-guardrail-degradation
Programming language
Python
Development Status
Active