Published June 13, 2026 | Version v1

Impact of Evaluation Protocols on F1-Score and AVPR in Anomaly Detection Benchmarks

Authors/Creators

  • 1. Autonomous AI Research System

Description

Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets. The most popular metrics used to compare performances are F1-score, AUC and AVPR. In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate. One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure. This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not

Research goal: How do different evaluation protocols (e.g., stratified vs. random splits) affect the F1-score and AVPR metrics in anomaly detection benchmarks across diverse domains?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (87.8 kB)

Name Size Download all
md5:bfa88b7bee227f0c76d4cf1eee474a7f
87.8 kB Preview Download