Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening

Vu, Thi Ngoc Lan; Fooladi, Hosein; Kirchmair, Johannes

doi:10.5281/zenodo.14905986

Published February 21, 2025 | Version v1

Dataset Open

Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening

1. University of Vienna
2. Christian Doppler Laboratory for Molecular Informatics in the Biosciences

This dataset provides the input data of the analysis experiments presented in the work "Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening" [1] with associated code at https://github.com/lan-codes/Benchmark_VS

The dataset includes the following folders:

dudez: input molecules as SMILES strings provided in DUDE-Z [2] for each target.
docking_poses: docking poses by DiffDock-L [3] and AutoDock Vina [4,5] for 43 DUDE-Z targets.
plif: the reference ligands collected with SIENA [6] and the protein-ligand interaction fingerprints generated with ProLIF [7] for docked compounds and reference ligands for each target.
posebusters: the pose validity check results generated with PoseBusters [8] for all docking poses for each target.

Paper Abstract:

Physics-based docking methods have long been the cornerstone of structure-based virtual screening (VS). However, the emergence of machine learning (ML)-based docking approaches has opened up new possibilities for enhancing VS technologies. In this study, we explore the integration of DiffDock-L, a leading ML-based pose sampling method, into VS workflows by combining it with the well-established Vina and Gnina scoring functions. We assess this integrated approach in terms of its VS effectiveness, pose sampling quality, and complementarity to traditional physics-based docking methods, such as AutoDock Vina. Our findings from the DUDE-Z benchmark dataset show that DiffDock-L performs competitively in both VS performance and pose sampling in cross-docking settings. In most cases, it generates physically plausible and biologically relevant poses, establishing itself as a viable alternative to physics-based docking algorithms. Additionally, we found that the choice of scoring function significantly influences VS success.

References

[1] Vu, T.N.L.; Fooladi, H. and Kirchmair, J. Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening. ChemRxiv. 2025; DOI:10.26434/chemrxiv-2025-96kzg-v2.

[2] Stein, R. M.; Yang, Y.; Balius, T. E.; O’Meara, M. J.; Lyu, J.; Young, J.; Tang, K.; Shoichet, B. K.; Irwin, J. J. Property-Unmatched Decoys in Docking Benchmarks. J. Chem. Inf. Model. 2021, 61 (2), 699–714. DOI: 10.1021/acs.jcim.0c00598.

[3] Corso, G.; Deng, A.; Fry, B.; Polizzi, N.; Barzilay, R.; Jaakkola, T. Deep Confident Steps to New Pockets: Strategies for Docking Generalization. arXiv February 28, 2024. DOI: 10.48550/arXiv.2402.18396.

[4] Trott, O.; Olson, A. J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010, 31 (2), 455–461. DOI: 10.1002/jcc.21334.

[5] Eberhardt, J.; Santos-Martins, D.; Tillack, A. F.; Forli, S. AutoDock Vina 1.2. 0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61 (8), 3891–3898. DOI: 10.1021/acs.jcim.1c00203.

[6] Bietz, S. Rarey, M.: SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles. Journal of Chemical Information and Modeling,56(1): 248-59. DOI: 10.1021/acs.jcim.5b00588.

[7] Bouysset, C.; Fiorucci, S. ProLIF: A Library to Encode Molecular Interactions as Fingerprints. J. Cheminformatics 2021, 13 (1), 72. DOI: 10.1186/s13321-021-00548-6.

[8] Buttenschoen, M.; Morris, G. M.; Deane, C. M. PoseBusters: AI-Based Docking Methods Fail to Generate Physically Valid Poses or Generalise to Novel Sequences. Chem. Sci. 2024, 15 (9), 3130–3139. DOI: 10.1039/D3SC04185A.

Files

data.zip

Files (8.6 GB)

Name	Size	Download all
data.zip md5:baa1d2f4659c03a43eeb9bf82cc9681a	8.6 GB	Preview Download

	All versions	This version
Views	27	27
Downloads	6	6
Data volume	51.4 GB	51.4 GB

Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening

Files

data.zip

Files (8.6 GB)

Additional details

Related works

Integrating Machine Learning-Based Pose Sampling with Established Scoring Functions for Virtual Screening

Creators

Description

Files

data.zip

Files (8.6 GB)

Additional details

Related works