Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

Roth, Simon

doi:10.5281/zenodo.20451150

Published May 29, 2026 | Version 1.1

Preprint Open

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

Roth, Simon

Twenty-eight within-subject counterfactual experiments across 2,047 iid tabular datasets, plus a boundary experiment on 129 temporal datasets, measure the severity of four data leakage classes in machine learning. Class I (estimation: fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0.005. Class II (selection: peeking, seed cherry-picking) is substantial: the measured effect is consistent with about 90% noise exploitation inflating reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree) at 10% duplication. Class IV (boundary) is invisible under random cross-validation. Within this iid tabular regime, the textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Files

roth2026_landscape_leakage_types_v1.1.pdf

Files (464.6 kB)

Name	Size	Download all
roth2026_landscape_leakage_types_v1.1.pdf md5:99eb4d5f6431cfd072e55ebc30606fcd	464.6 kB	Preview Download

Additional details

Is identical to: Preprint: arXiv:2604.04199 (arXiv)
Is supplement to: Preprint: arXiv:2603.10742 (arXiv); Preprint: 10.5281/zenodo.20450649 (DOI)
Is supplemented by: Software: https://github.com/epagogy/ml (URL)

	All versions	This version
Views	632	67
Downloads	603	47
Data volume	599.9 MB	25.1 MB

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

Authors/Creators

Description

Files

roth2026_landscape_leakage_types_v1.1.pdf

Files (464.6 kB)

Additional details

Related works