ASR Does Not Measure What You Think It Measures: A Comparative Analysis of Attack Success Scoring Methods in Adversarial LLM Evaluation
Description
This paper presents an empirical comparison of two attack success scoring methodologies used in adversarial Large Language Model (LLM) evaluation.
Using a human-annotated ground truth corpus of 85 adversarial responses generated with Llama-3.3-70B via Groq API, the study demonstrates that scorer design alone can dramatically alter reported Attack Success Rate (ASR) metrics.
The paper identifies three major scorer failure modes:
-
refusal-mention ambiguity
-
library coverage problem
-
indirect injection scoring gap
A minimal “Refusal-First Standard” for adversarial LLM scorers is proposed, along with recommendations for reporting False Positive Rate (FPR) alongside ASR in future LLM security evaluation studies.
Artifacts released:
-
paper PDF
-
scorer methodology
-
evaluation framework
-
adversarial corpus references
-
experimental findings
Research areas:
LLM Security, Prompt Injection, Adversarial Evaluation, AI Security, Benchmark Reliability.
Files
Viana_SPEF_Framework_LLM_Security-2-ARS.pdf
Files
(302.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:3de89595e3af1569863b55ae097a7670
|
302.3 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Software: https://github.com/gugacyber/spef_experiment (URL)
Software
- Repository URL
- https://github.com/gugacyber/spef_experiment
- Programming language
- Python
- Development Status
- Active