There is a newer version of the record available.

Published April 3, 2026 | Version 1.0

Which Leakage Types Matter?

Authors/Creators

Description

Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation — fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0.005. Class II (selection — peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Files

roth2026_landscape_leakage_types_v1.pdf

Files (446.6 kB)

Name Size Download all
md5:b32a7ac8aeeff7592f2e19f5f7a5e1b8
446.6 kB Preview Download

Additional details

Additional titles

Subtitle (English)
A Quantitative Landscape Across 2,047 Benchmark Datasets

Related works

Is supplement to
Preprint: arXiv:2603.10742 (arXiv)
Preprint: 10.5281/zenodo.19406355 (DOI)
Is supplemented by
Software: https://github.com/epagogy/ml (URL)