QSAR Leakage-Free Benchmark — EGFR pIC50 (ChEMBL v33)

Merlini, Igor

doi:10.5281/zenodo.19953763

Published May 1, 2026 | Version v1.0.0

Dataset Open

QSAR Leakage-Free Benchmark — EGFR pIC50 (ChEMBL v33)

Merlini, Igor (Researcher)¹

1. ActarusLab

Reproducible benchmark dataset and code for the manuscript:

"Scaffold-Aware Evaluation Reveals Substantial Performance Inflation in EGFR pIC50 Benchmarks: A Reproducible Analysis on ChEMBL v33"

Key finding: standard K-Fold cross-validation overestimates R² by +93% on 10,113 EGFR inhibitors from ChEMBL v33 (R² = 0.679 vs 0.352 under scaffold-honest evaluation). Despite low absolute R², the scaffold-honest ensemble achieves Concordance Index = 0.728 and Enrichment Factor@5% = 7.78×, confirming practical utility for virtual screening on novel scaffolds.

Contents:
- paper/ — full manuscript PDF (preprint v2)
- data/ — 10,113 EGFR compounds with pre-defined scaffold splits, scaffold summary, machine-readable benchmark results JSON
- figures/ — 6 publication-ready figures (300 dpi)
- code/ — reproducible Jupyter notebook (ChEMBL download to final results in one execution)
- README.md — complete documentation

The Leakage Ladder framework defines four progressively stricter evaluation levels:
L1 — Random K-Fold (standard practice, R² = 0.679)
L2 — Multi-Seed OOF (claimed honest, R² = 0.703)
L3 — Scaffold Split single model (R² = 0.315)
L4 — Final ensemble on scaffold test (R² = 0.352, CI = 0.728)

All results are reproducible from the included notebook (1-click on Kaggle with GPU).

Companion preprint on ChemRxiv: DOI 10.26434/chemrxiv.15001489
Author's research page: actaruslab.org

Files

Merlini_2026_QSAR_Leakage_Ladder_Zenodo.zip

Files (1.5 MB)

Name	Size	Download all
Merlini_2026_QSAR_Leakage_Ladder_Zenodo.zip md5:b6cb0c66c0aa9d2a7dce96ef8fa9c2e0	1.5 MB	Preview Download

Additional details

Is supplement to: Preprint: 10.26434/chemrxiv.15001489 (DOI); Computational notebook: https://www.kaggle.com/code/igormerlinicomposer/qsar-leakage-free-benchmark-egfr-pic50 (URL)

	All versions	This version
Views	14	14
Downloads	2	2
Data volume	3.0 MB	3.0 MB

QSAR Leakage-Free Benchmark — EGFR pIC50 (ChEMBL v33)

Authors/Creators

Description

Files

Merlini_2026_QSAR_Leakage_Ladder_Zenodo.zip

Files (1.5 MB)

Additional details

Related works