ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
Description
This preprint introduces ADVERSA (Adversarial Dynamics and Vulnerability Evaluation of Resistance Surfaces in AI), an automated red-teaming framework for measuring multi-turn guardrail behavior and judge reliability in large language models. Rather than treating safety evaluation as a binary jailbreak or no-jailbreak outcome, ADVERSA models compliance as a per-round trajectory using a structured five-point rubric ranging from hard refusal to full compliance.
The framework evaluates frontier models through sustained adversarial interaction, using a fine-tuned 70B attacker model and a triple-judge consensus panel. Across controlled experiments, the study analyzes jailbreak concentration by round, judge disagreement, self-judge effects, attacker drift, and attacker-side refusals as a confound in automated red teaming.
This record contains the paper manuscript. The associated code, logs, and supporting materials are available through the project repository.
Files
ADVERSA_paper.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7d008d6cec8afe43e365e2dbf8debb58
|
1.1 MB | Preview Download |
Additional details
Related works
- Is supplemented by
- Software: https://github.com/Harry-Ashley/adversa-guardrail-degradation (URL)
Software
- Repository URL
- https://github.com/Harry-Ashley/adversa-guardrail-degradation
- Programming language
- Python
- Development Status
- Active