There is a newer version of the record available.

Published June 10, 2026 | Version nc_1000_v1.0
Dataset Open

modelforge curated dataset: GEOM QM9

  • 1. Memorial Sloan Kettering Cancer Center

Description

Modelforge Curated GEOM QM9 Dataset:
- 1000 conformer test dataset
- Version: nc_1000_v1.0:

This provides a curated hdf5 file for the QM9 subset of the Geometric Ensemble Of Molecules (GEOM) datase (https://doi.org/10.1038/s41597-022-01288-4).  The GEOM QM9 dataset samples the 133,885 organic molecules with up to nine total heavy atoms (C,O,N,or F; excluding H) from the original QM9 dataset ( https://doi.org/10.1038/sdata.2014.22), generating multiple configurations for each molecule using the CREST software that relies on GFN2-XTB.  Energies were evaluated using DFT via ORCA 5.0.2  using the r2scan-3c functional and mTZVPP basis.

 

The provided hdf5 file contains a subset of this dataset to be used for testing purposes, designed to be compatible with modelforge, an infrastructure to implement and train NNPs.  The GEOM QM9 dataset   This test dataset contains 1000 total configurations for 67 different systems.

 

When applicable, the units of properties are provided in the datafile,  encoded as strings compatible with the openff-units package.  For more information about the structure of the data file, please see the following:

Properties Included:    

  • atomic_numbers 
  • positions      
    •  "per_atom"
    • "nanometer"
  • dft_total_energy 
    • "per_system"
    • "kilojoule_per_mole"
  •  total_charge
    • "per_system"
    • "elementary_charge"
  • smiles
    • "meta_data"
             

Files

Files (508.3 kB)

Name Size Download all
md5:1627bf69a7029b2822733152f8a3e445
508.3 kB Download

Additional details

Related works

Is derived from
Publication: 10.1038/s41597-022-01288-4 (DOI)
Dataset: 10.7910/DVN/JNGTDF (DOI)

Software

Repository URL
https://github.com/choderalab/modelforge
Programming language
Python
Development Status
Active