There is a newer version of the record available.

Published May 8, 2023 | Version 1
Dataset Open

Large-scale Docking Datasets for Machine Learning

  • 1. Science for Life Laboratory, Uppsala University
  • 2. Uppsala University, Stockholm University, Örebro University

Description

Large-scale virtual screening has become a valuable tool for early-phase drug discovery. Recent expansions of commercial chemical space have made it computationally intractable to evaluate all compounds in the libraries. Machine learning is one of the methods that aim to prioritize specific subsets of these vast libraries. In order to put these methods to the test, access to large-scale datasets is beneficial. To help the community benchmark their work, we share the docking scores of several ultralarge virtual screening campaigns.

The datasets we provide contain canonical SMILES, compound identifiers, and docking scores. We docked two different chemical libraries against eight different biological targets with therapeutic relevance. The first dataset contained approximately 15.5 million molecules adhering to the "Rule-of-Four", whereas the second datasets consists of approximately 235 million "lead-like" molecules. The biological targets represent different classes of proteins and binding sites.

More details on the datasets and our methods can be found on (https://github.com/carlssonlab/conformalpredictor) and our pre-print (https://doi.org/10.26434/chemrxiv-2023-w3x36).

Please feel free to download and use these datasets for your own research purposes. We only ask that you cite our pre-print and datasets appropriately if you use it in your work. Thank you for your interest in our research!

Files

Files (35.8 GB)

Name Size Download all
md5:dd365f64c397259d24111e38634e91fc
2.5 GB Download
md5:8476a5cf08d5b7219584e24385e8c85f
33.3 GB Download