Which Leakage Types Matter?
Authors/Creators
Description
Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation — fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0.005. Class II (selection — peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.
Files
roth2026_landscape_leakage_types_v1.pdf
Files
(446.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b32a7ac8aeeff7592f2e19f5f7a5e1b8
|
446.6 kB | Preview Download |
Additional details
Additional titles
- Subtitle (English)
- A Quantitative Landscape Across 2,047 Benchmark Datasets
Related works
- Is supplement to
- Preprint: arXiv:2603.10742 (arXiv)
- Preprint: 10.5281/zenodo.19406355 (DOI)
- Is supplemented by
- Software: https://github.com/epagogy/ml (URL)