Published February 15, 2024 | Version v1
Dataset Open

Dataset, splits, models, and scripts for the QM descriptors prediction

  • 1. ROR icon Massachusetts Institute of Technology
  • 2. ROR icon National Taiwan University

Description

Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

Below are descriptions of the available scripts:

  1. atom_bond_descriptors.sh: Trains atom/bond targets.
  2. atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.
  3. dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.
  4. dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.
  5. energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).
  6. energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.
  7. get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.
  8. csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

Below is the procedure for running the ml-QM-GNN on your own dataset:

  1. Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.
  2. Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.
  3. Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).
  4. Run Chemprop to train your models using the additional predicted features supported here.

Files

Files (3.5 GB)

Name Size Download all
md5:abeae4be4440fb67daaccd9ffa18d559
1.9 GB Download
md5:cb07284e8c06181f5154a291dfaebc49
1.6 GB Download
md5:fbb5197a843ad05455a3ef38d171c819
3.2 kB Download
md5:d77fc36388a10710737c4499428ee873
293.1 kB Download

Additional details

Funding

Defense Advanced Research Projects Agency
Accelerated Molecular Discovery (AMD) program HR00111920025
National Science and Technology Council
Young Scholar Fellowship Einstein Program 112-2636-E-002-005