ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Owiredu-Ashley, Harry

doi:10.5281/zenodo.18927906

Published March 9, 2026 | Version v2

Preprint Open

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Owiredu-Ashley, Harry¹

1. Independent Researcher

This preprint introduces ADVERSA (Adversarial Dynamics and Vulnerability Evaluation of Resistance Surfaces in AI), an automated red-teaming framework for measuring multi-turn guardrail behavior and judge reliability in large language models. Rather than treating safety evaluation as a binary jailbreak or no-jailbreak outcome, ADVERSA models compliance as a per-round trajectory using a structured five-point rubric ranging from hard refusal to full compliance.

The framework evaluates frontier models through sustained adversarial interaction, using a fine-tuned 70B attacker model and a triple-judge consensus panel. Across controlled experiments, the study analyzes jailbreak concentration by round, judge disagreement, self-judge effects, attacker drift, and attacker-side refusals as a confound in automated red teaming.

This record contains the paper manuscript. The associated code, logs, and supporting materials are available through the project repository.

Files

ADVERSA_paper.pdf

Files (1.1 MB)

Name	Size	Download all
ADVERSA_paper.pdf md5:7d008d6cec8afe43e365e2dbf8debb58	1.1 MB	Preview Download

Additional details

Is supplemented by: Software: https://github.com/Harry-Ashley/adversa-guardrail-degradation (URL)

Repository URL: https://github.com/Harry-Ashley/adversa-guardrail-degradation
Programming language: Python
Development Status: Active

	All versions	This version
Views	68	32
Downloads	46	17
Data volume	57.2 MB	28.1 MB

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Authors/Creators

Description

Files

ADVERSA_paper.pdf

Files (1.1 MB)

Additional details

Related works

Software