Published February 9, 2023 | Version 0.1
Dataset Open

Artificial Dataset of Molecular Enthalpies of Formation

Authors/Creators

  • 1. MIT, VCU

Description

This repository contains an artificial dataset constructed for the study of uncertainty characterization and quantification in chemical ML applications. The data are designed to be noise-free and represent group-additivity calculations of enthalpy of formation, rather than calculated or measured enthalpy of formation directly.

Where did the targets come from:
These data files contain SMILES and targets for a simple group additivity calculation of enthalpy of formation at 298 K. The group additivity coefficients were fitted to the molecules of the qm9 computational chemistry database. Fragments were only considered that appeared in at least 100 molecules. These coefficients were rounded to 3 decimals. Groups only consider a bond radius of 1 from the central atom.

Where did the SMILES come from:
The group additivity coefficients were applied to the gdb11 computational chemistry dataset. The gdb11 dataset contains 26.4M molecular SMILES, attempting to cover all possible organic molecules up to 11 heavy atoms with the atoms C, H, O, N, F. Molecules that contained groups that were not represented in the group additivity coefficients were excluded, resulting in 7,906,815 SMILES. Though these SMILES contain chiral centers, they are not chirally specified. No SMILES repeats are present.

Scripts used for generating data subsets and added-noise datasets are also included.

Files

groupadditivity_h298.zip

Files (146.5 MB)

Name Size Download all
md5:06e3dd5a2c9486ee78519c15b49c6ace
146.5 MB Preview Download

Additional details

Related works

Is supplement to
Preprint: 10.26434/chemrxiv-2023-00vcg (DOI)