Published May 7, 2025 | Version 2025-Q1
Dataset Open

SDDF Energy Dataset

Description

This conformational energy dataset, developed as part of the Smart Distributed Data Factory (SDDF) project, contains over 2.75 million molecular conformations based on drug-like molecules sourced from the ENAMINE database. Energies were calculated using DFT with the ωB97x density functional and the 6–31G(d) basis set. The conformations were generated from SMILES using RDKit, MMFF94 optimization, and molecular dynamics (MD) simulations, providing a diverse set of molecular structures and energy states.

  • RDKit Conformations: 1,123,693
  • RDKit + MMFF94 Optimized: 1,151,936
  • MD-Generated: 483,279

This dataset serves as a benchmark for energy prediction models, with training (638,617 examples), validation (134,732 examples), and test subsets (24,890 examples) created using a strict scaffold-based split to ensure no overlap and less than 70% similarity between the training and test sets.

Dataset contents:

  • data.tar.gz: contains the conformations in Structured Data File format, grouped into separate folders based on the molecule ID. Each conformation's label is provided within its SDF file as a property named "energy".
  • INDEX.smi: specifies the molecule IDs and their corresponding SMILES.
  • SOURCES.csv: specifies the conformation generation method for each conformation.
  • SDDF_train.tsv, SDDF_validation.tsv, and SDDF_test.tsv specify the molecule IDs and conformations for each subset of the benchmark.

A detailed description is provided in the accompanying paper.

Files

SOURCES.csv

Files (2.2 GB)

Name Size Download all
md5:2e43915f25ad4e652e320e7f2a59149a
2.0 GB Download
md5:b95a8a7f2a2902ab168ee3c0d81e577d
36.7 MB Download
md5:cd06458b02fc78b0608e63bd3295ea08
260.0 kB Download
md5:592ed3f45ca8dffd87876387f838ff63
6.8 MB Download
md5:2ea7ab2a2b3f588761d574cba4d2a6e5
1.4 MB Download
md5:2080a0a25911eba5dcc59a2363ddd0fa
74.2 MB Preview Download

Additional details

Additional titles

Alternative title
SDDF-Energy-2025Q1

Related works

Is published in
Preprint: 10.1101/2024.10.22.619651 (DOI)

Software