QSAR Leakage-Free Benchmark — EGFR pIC50 (ChEMBL v33)
Description
Reproducible benchmark dataset and code for the manuscript:
"Scaffold-Aware Evaluation Reveals Substantial Performance Inflation in EGFR pIC50 Benchmarks: A Reproducible Analysis on ChEMBL v33"
Key finding: standard K-Fold cross-validation overestimates R² by +93% on 10,113 EGFR inhibitors from ChEMBL v33 (R² = 0.679 vs 0.352 under scaffold-honest evaluation). Despite low absolute R², the scaffold-honest ensemble achieves Concordance Index = 0.728 and Enrichment Factor@5% = 7.78×, confirming practical utility for virtual screening on novel scaffolds.
Contents:
- paper/ — full manuscript PDF (preprint v2)
- data/ — 10,113 EGFR compounds with pre-defined scaffold splits, scaffold summary, machine-readable benchmark results JSON
- figures/ — 6 publication-ready figures (300 dpi)
- code/ — reproducible Jupyter notebook (ChEMBL download to final results in one execution)
- README.md — complete documentation
The Leakage Ladder framework defines four progressively stricter evaluation levels:
L1 — Random K-Fold (standard practice, R² = 0.679)
L2 — Multi-Seed OOF (claimed honest, R² = 0.703)
L3 — Scaffold Split single model (R² = 0.315)
L4 — Final ensemble on scaffold test (R² = 0.352, CI = 0.728)
All results are reproducible from the included notebook (1-click on Kaggle with GPU).
Companion preprint on ChemRxiv: DOI 10.26434/chemrxiv.15001489
Author's research page: actaruslab.org
Files
Merlini_2026_QSAR_Leakage_Ladder_Zenodo.zip
Files
(1.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b6cb0c66c0aa9d2a7dce96ef8fa9c2e0
|
1.5 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.26434/chemrxiv.15001489 (DOI)
- Computational notebook: https://www.kaggle.com/code/igormerlinicomposer/qsar-leakage-free-benchmark-egfr-pic50 (URL)