Published July 22, 2023
| Version 1.0
Dataset
Open
Benchmark Data for Chemprop
Creators
- 1. Massachusetts Institute of Technology, TU Wien
- 2. Massachusetts Institute of Technology
- 3. Massachusetts Institute of Technology, National Taiwan University
- 4. Harvard University, Massachusetts Institute of Technology
- 5. Massachusetts Institute of Technology, KU Leuven
- 6. Massachusetts Institute of Technology, Virginia Commonwealth University
Description
Datasets and splits of the manuscript "Chemprop: Machine Learning Package for Chemical Property Prediction." Train, validation and test splits are located within each folder, as well as additional data necessary for some of the benchmarks. To train Chemprop models, refer to our code repository to obtain ready-to-use scripts to train machine learning models for each of the systems.
Available benchmarking systems:
- `hiv` HIV replication inhibition from MoleculeNet and OGB with scaffold splits
- `pcba_random` Biological activities from MoleculeNet and OGB with random splits
- `pcba_scaffold` Biological activities from MoleculeNet and OGB with scaffold splits
- `qm9_multitask` DFT calculated properties from MoleculeNet and OGB, trained as a multi-task model
- `qm9_u0` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target U0 only
- `qm9_gap` DFT calculated properties from MoleculeNet and OGB, trained as a single-task model on the target gap only
- `sampl` Water-octanol partition coefficients, used to predict molecules from the SAMPL6, 7 and 9 challenges
- `atom_bond_137k` Quantum-mechanical atom and bond descriptors
- `bde` Bond dissociation enthalpies trained as single-task model
- `bde_charges` Bond dissociation enthalpies trained as multi-task model together with atomic partial charges
- `charges_eps_4` Partial charges at a dielectric constant of 4 (in protein)
- `charges_eps_78` Partial charges at a dielectric constant of 78 (in water)
- `barriers_e2` Reaction barrier heights of E2 reactions
- `barriers_sn2` Reaction barrier heights of SN2 reactions
- `barriers_cycloadd` Reaction barrier heights of cycloaddition reactions
- `barriers_rdb7` Reaction barrier heights in the RDB7 dataset
- `barriers_rgd1` Reaction barrier heights in the RGD1-CNHO dataset
- `multi_molecule` UV/Vis peak absorption wavelengths in different solvents
- `ir` IR Spectra
- `pcqm4mv2` HOMO-LUMO gaps of the PCQM4Mv2 dataset
- `uncertainty_ensemble` Uncertainty estimation using an ensemble using the QM9 gap dataset
- `uncertainty_evidential` Uncertainty estimation using evidential learning using the QM9 gap dataset
- `uncertainty_mve` Uncertainty estimation using mean-variance estimation using the QM9 gap dataset
- `timing` Timing benchmark using subsets of QM9 gap
Files
Files
(1.6 GB)
Name | Size | Download all |
---|---|---|
md5:55375620453241f9f60b275536160556
|
1.6 GB | Download |
Additional details
Funding
- Computer-aided design of multi-enzyme networks J 4415
- FWF Austrian Science Fund
- Graduate Research Fellowship Program (GRFP) 1745302
- National Science Foundation