There is a newer version of the record available.

Published July 22, 2023 | Version 1.0
Dataset Open

Benchmark Data for Chemprop

  • 1. Massachusetts Institute of Technology, TU Wien
  • 2. Massachusetts Institute of Technology
  • 3. Massachusetts Institute of Technology, National Taiwan University
  • 4. Harvard University, Massachusetts Institute of Technology
  • 5. Massachusetts Institute of Technology, KU Leuven
  • 6. Massachusetts Institute of Technology, Virginia Commonwealth University

Description

Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.

Available benchmarking systems:

  •  `hiv` HIV replication inhibition from MoleculeNet and OGB with scaffold splits
  •  `pcba_random` Biological activities from MoleculeNet and OGB with random splits
  •  `pcba_scaffold` Biological activities from MoleculeNet and OGB with scaffold splits
  •  `qm9_multitask` DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model
  •  `qm9_u0` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only
  •  `qm9_gap` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only
  •  `sampl` Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges
  •  `atom_bond_137k` Quantum-mechanical atom and bond descriptors
  •  `bde` Bond dissociation enthalpies trained as single-task model
  •  `bde_charges` Bond dissociation enthalpies trained as multi-task model together with atomic partial charges
  •  `charges_eps_4` Partial charges at a dielectric constant of 4 (in protein)
  •  `charges_eps_78` Partial charges at a dielectric constant of 78 (in water)
  •  `barriers_e2` Reaction barrier heights of E2 reactions
  •  `barriers_sn2` Reaction barrier heights of SN2 reactions
  •  `barriers_cycloadd` Reaction barrier heights of cycloaddition reactions
  •  `barriers_rdb7` Reaction barrier heights in the RDB7 dataset
  •  `barriers_rgd1` Reaction barrier heights in the RGD1-CNHO dataset
  •  `multi_molecule` UV/Vis peak absorption wavelengths in different solvents
  •  `ir` IR Spectra
  •  `pcqm4mv2` HOMO-LUMO gaps of the PCQM4Mv2 dataset
  •  `uncertainty_ensemble` Uncertainty estimation using an ensemble using the QM9 gap dataset
  •  `uncertainty_evidential` Uncertainty estimation using evidential learning using the QM9 gap dataset
  •  `uncertainty_mve` Uncertainty estimation using mean-variance estimation using the QM9 gap dataset
  •  `timing` Timing benchmark using subsets of QM9 gap

Files

Files (1.6 GB)

Name Size Download all
md5:55375620453241f9f60b275536160556
1.6 GB Download

Additional details

Funding

Computer-aided design of multi-enzyme networks J 4415
FWF Austrian Science Fund
Graduate Research Fellowship Program (GRFP) 1745302
National Science Foundation