Published May 1, 2026 | Version v1.0.0
Dataset Open

QSAR Leakage-Free Benchmark — EGFR pIC50 (ChEMBL v33)

  • 1. ActarusLab

Description

Reproducible benchmark dataset and code for the manuscript:

"Scaffold-Aware Evaluation Reveals Substantial Performance Inflation in EGFR pIC50 Benchmarks: A Reproducible Analysis on ChEMBL v33"

Key finding: standard K-Fold cross-validation overestimates R² by +93% on 10,113 EGFR inhibitors from ChEMBL v33 (R² = 0.679 vs 0.352 under scaffold-honest evaluation). Despite low absolute R², the scaffold-honest ensemble achieves Concordance Index = 0.728 and Enrichment Factor@5% = 7.78×, confirming practical utility for virtual screening on novel scaffolds.

Contents:
- paper/ — full manuscript PDF (preprint v2)
- data/ — 10,113 EGFR compounds with pre-defined scaffold splits, scaffold summary, machine-readable benchmark results JSON
- figures/ — 6 publication-ready figures (300 dpi)
- code/ — reproducible Jupyter notebook (ChEMBL download to final results in one execution)
- README.md — complete documentation

The Leakage Ladder framework defines four progressively stricter evaluation levels:
L1 — Random K-Fold (standard practice, R² = 0.679)
L2 — Multi-Seed OOF (claimed honest, R² = 0.703)
L3 — Scaffold Split single model (R² = 0.315)
L4 — Final ensemble on scaffold test (R² = 0.352, CI = 0.728)

All results are reproducible from the included notebook (1-click on Kaggle with GPU).

Companion preprint on ChemRxiv: DOI 10.26434/chemrxiv.15001489
Author's research page: actaruslab.org

Files

Merlini_2026_QSAR_Leakage_Ladder_Zenodo.zip

Files (1.5 MB)

Name Size Download all
md5:b6cb0c66c0aa9d2a7dce96ef8fa9c2e0
1.5 MB Preview Download

Additional details