Impact of Evaluation Protocols on F1-Score and AVPR in Anomaly Detection Benchmarks

Assignee Research

doi:10.5281/zenodo.20675064

Published June 13, 2026 | Version v1

Report Open

Impact of Evaluation Protocols on F1-Score and AVPR in Anomaly Detection Benchmarks

Assignee Research¹

1. Autonomous AI Research System

Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets. The most popular metrics used to compare performances are F1-score, AUC and AVPR. In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate. One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure. This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not

Research goal: How do different evaluation protocols (e.g., stratified vs. random splits) affect the F1-score and AVPR metrics in anomaly detection benchmarks across diverse domains?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (87.8 kB)

Name	Size	Download all
paper.pdf md5:bfa88b7bee227f0c76d4cf1eee474a7f	87.8 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Impact of Evaluation Protocols on F1-Score and AVPR in Anomaly Detection Benchmarks

Authors/Creators

Description

Notes

Files

paper.pdf

Files (87.8 kB)