Impact of Evaluation Protocols on F1-Score and AVPR in Anomaly Detection Benchmarks
Description
Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets. The most popular metrics used to compare performances are F1-score, AUC and AVPR. In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate. One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure. This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not
Research goal: How do different evaluation protocols (e.g., stratified vs. random splits) affect the F1-score and AVPR metrics in anomaly detection benchmarks across diverse domains?
Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(87.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:bfa88b7bee227f0c76d4cf1eee474a7f
|
87.8 kB | Preview Download |