Published November 9, 2023 | Version v2
Dataset Open

Benchmark Data for Chemprop

  • 1. Massachusetts Institute of Technology, TU Wien
  • 2. Massachusetts Institute of Technology
  • 3. Massachusetts Institute of Technology, National Taiwan University
  • 4. Harvard University, Massachusetts Institute of Technology
  • 5. Massachusetts Institute of Technology, KU Leuven
  • 6. Massachusetts Institute of Technology, Virginia Commonwealth University

Description

Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.

Available benchmarking systems:

  • `hiv` HIV replication inhibition from MoleculeNet and OGB with scaffold splits
  • `pcba_random` Biological activities from MoleculeNet with random splits  (with missing targets filled in with zeros as provided by MoleculeNet)
  • `pcba_random_nans` Biological activities from MoleculeNet with random splits and data format to match OGB (with missing targets not filled in with zeros)
  • `pcba_scaffold` Biological activities from OGB with scaffold splits
  • `qm9_multitask` DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model
  • `qm9_u0` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only
  • `qm9_gap` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only
  • `sampl` Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges
  • `atom_bond_137k` Quantum-mechanical atom and bond descriptors
  • `bde` Bond dissociation enthalpies trained as single-task model
  • `bde_charges` Bond dissociation enthalpies trained as multi-task model together with atomic partial charges
  • `charges_eps_4` Partial charges at a dielectric constant of 4 (in protein)
  • `charges_eps_78` Partial charges at a dielectric constant of 78 (in water)
  • `barriers_e2` Reaction barrier heights of E2 reactions
  • `barriers_sn2` Reaction barrier heights of SN2 reactions
  • `barriers_cycloadd` Reaction barrier heights of cycloaddition reactions
  • `barriers_rdb7` Reaction barrier heights in the RDB7 dataset
  • `barriers_rgd1` Reaction barrier heights in the RGD1-CNHO dataset
  • `multi_molecule` UV/Vis peak absorption wavelengths in different solvents
  • `ir` IR Spectra
  • `pcqm4mv2` HOMO-LUMO gaps of the PCQM4Mv2 dataset
  • `uncertainty_ensemble` Uncertainty estimation using an ensemble using the QM9 gap dataset
  • `uncertainty_evidential` Uncertainty estimation using evidential learning using the QM9 gap dataset
  • `uncertainty_mve` Uncertainty estimation using mean-variance estimation using the QM9 gap dataset
  • `timing` Timing benchmark using subsets of QM9 gap

Version: This version of the dataset (Version 2) is compatible with all versions of Chemprop (supporting the respective functionality). Version 1 of this dataset is compatible with all versions except Chemprop v.1.6.1, which cannot process the `charges_eps_4`  and `charges_eps_78` datasets (all other benchmarks work as expected). We therefore recommend to always use Version 2 of the dataset (with reformatted `charges_eps_4`  and `charges_eps_78`  datasets), since it is compatible with all versions of Chemprop. For use with any other ML software, you can use any version.

Files

Files (1.6 GB)

Name Size Download all
md5:57fef6993e9539ed8751a630c337e841
1.6 GB Download

Additional details

Funding

Computer-aided design of multi-enzyme networks J 4415
FWF Austrian Science Fund
Graduate Research Fellowship Program (GRFP) 1745302
National Science Foundation
KU Leuven Internal Starting Grant STG/22/032
KU Leuven