Published January 1, 2024 | Version 1.0.1
Dataset Open

CSS and Benchmark Datasets of GeminiMol

  • 1. ROR icon ShanghaiTech University

Description

The molecular representation model is a neural network that converts molecular representations (SMILES, Graph) into feature vectors, that carries the potential to be applied across a wide scope of drug discovery scenarios. However, current molecular representation models have been limited to 2D or static 3D structures, overlooking the dynamic nature of small molecules in solution and their ability to adopt flexible conformational changes crucial for drug-target interactions.

To address this limitation, we propose a novel strategy that incorporates the conformational space profile into molecular representation learning. By capturing the intricate interplay between molecular structure and conformational space, our strategy enhances the representational capacity of our model named GeminiMol. Consequently, when pre-trained on a miniaturized molecular dataset, the GeminiMol model demonstrates a balanced and superior performance not only on traditional molecular property prediction tasks but also on zero-shot learning tasks, including virtual screening and target identification. By capturing the dynamic behavior of small molecules, our strategy paves the way for rapid exploration of chemical space, facilitating the transformation of drug design paradigms. 

In this study, a diverse collection of 39,290 molecules was employed for conformational searching and shape alignment to generate a comprehensive dataset of molecular conformational space similarity. To assess the model's performance, the benchmark datasets comprising over millions molecules was utilized for downstream tasks. Here, we provide all the training and benchmarking data used for this study to facilitate the reproducibility of the work.

Files

Benchmark_DUD-E.zip

Files (2.0 GB)

Name Size Download all
md5:a2d53344e0f92e2006b2700671b088bf
27.0 MB Preview Download
md5:328c960d8ea48ab30eb7f79172736e59
34.8 MB Preview Download
md5:6dcc1ebb049b29cbb27b8e1616b2bc01
181.1 MB Preview Download
md5:4bfc4aa5bd14bb9c4cab8ecded26b4e7
1.2 GB Preview Download
md5:ffd70739300e896af23ccc5899fea7b2
50.1 MB Preview Download
md5:6a6ae63ed2b8b8ba712de82b2826cab1
12.1 MB Preview Download
md5:80dbfdf2c08a770cd00c6a22f1d5176e
441.5 MB Preview Download
md5:f9396b8759ef91ec1bd3fe2945bbceca
26.6 MB Preview Download

Additional details

Additional titles

Alternative title
Binding Identification Benchmark Dataset
Alternative title
Virtual Screening Benchmark Dataset
Alternative title
QSAR Benchmark Dataset
Alternative title
ADMET Benchmark Dataset
Alternative title
NCI/DTP QSAR datasets