Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening
Creators
Description
This dataset provides the input data of the analysis experiments presented in the work "Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening" [1] with associated code at https://github.com/lan-codes/Benchmark_VS
The dataset includes the following folders:
- dudez: input molecules as SMILES strings provided in DUDE-Z [2] for each target.
- docking_poses: docking poses by DiffDock-L [3] and AutoDock Vina [4,5] for 43 DUDE-Z targets.
- plif: the reference ligands collected with SIENA [6] and the protein-ligand interaction fingerprints generated with ProLIF [7] for docked compounds and reference ligands for each target.
- posebusters: the pose validity check results generated with PoseBusters [8] for all docking poses for each target.
Paper Abstract:
Physics-based docking methods have long been the cornerstone of structure-based virtual screening (VS). However, the emergence of machine learning (ML)-based docking approaches has opened up new possibilities for enhancing VS technologies. In this study, we explore the integration of DiffDock-L, a leading ML-based pose sampling method, into VS workflows by combining it with the well-established Vina and Gnina scoring functions. We assess this integrated approach in terms of its VS effectiveness, pose sampling quality, and complementarity to traditional physics-based docking methods, such as AutoDock Vina. Our findings from the DUDE-Z benchmark dataset show that DiffDock-L performs competitively in both VS performance and pose sampling in cross-docking settings. In most cases, it generates physically plausible and biologically relevant poses, establishing itself as a viable alternative to physics-based docking algorithms. Additionally, we found that the choice of scoring function significantly influences VS success.
References
[1] Vu, T.N.L.; Fooladi, H. and Kirchmair, J. Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening. ChemRxiv. 2025; DOI:10.26434/chemrxiv-2025-96kzg-v2.
[2] Stein, R. M.; Yang, Y.; Balius, T. E.; O’Meara, M. J.; Lyu, J.; Young, J.; Tang, K.; Shoichet, B. K.; Irwin, J. J. Property-Unmatched Decoys in Docking Benchmarks. J. Chem. Inf. Model. 2021, 61 (2), 699–714. DOI: 10.1021/acs.jcim.0c00598.
[3] Corso, G.; Deng, A.; Fry, B.; Polizzi, N.; Barzilay, R.; Jaakkola, T. Deep Confident Steps to New Pockets: Strategies for Docking Generalization. arXiv February 28, 2024. DOI: 10.48550/arXiv.2402.18396.
[4] Trott, O.; Olson, A. J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010, 31 (2), 455–461. DOI: 10.1002/jcc.21334.
[5] Eberhardt, J.; Santos-Martins, D.; Tillack, A. F.; Forli, S. AutoDock Vina 1.2. 0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61 (8), 3891–3898. DOI: 10.1021/acs.jcim.1c00203.
[6] Bietz, S. Rarey, M.: SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles. Journal of Chemical Information and Modeling,56(1): 248-59. DOI: 10.1021/acs.jcim.5b00588.
[7] Bouysset, C.; Fiorucci, S. ProLIF: A Library to Encode Molecular Interactions as Fingerprints. J. Cheminformatics 2021, 13 (1), 72. DOI: 10.1186/s13321-021-00548-6.
[8] Buttenschoen, M.; Morris, G. M.; Deane, C. M. PoseBusters: AI-Based Docking Methods Fail to Generate Physically Valid Poses or Generalise to Novel Sequences. Chem. Sci. 2024, 15 (9), 3130–3139. DOI: 10.1039/D3SC04185A.
Files
data.zip
Files
(8.6 GB)
Name | Size | Download all |
---|---|---|
md5:baa1d2f4659c03a43eeb9bf82cc9681a
|
8.6 GB | Preview Download |