Molecules used to train generative models
Description
This directory contains sets of molecules used to train chemical language models in the paper, "Learning generative models of molecules from limited training examples."
Between 1,000 and 500,000 molecules were sampled from each of four chemical databases (ChEMBL, COCONUT, GDB, and ZINC). These molecules were represented using either the SMILES, DeepSMILES, or SELFIES formats. For molecules in the SMILES format, data augmentation was also performed by enumerating non-canonical SMILES, with augmentation factors of 3x, 10x, or 30x. For each training dataset size, ten independent samples were drawn to assess variability.
Separately, the Diversity.zip archive contains samples of between 1,000 and 10,000 molecules used to evaluate the impact of chemical diversity on model performance. These training sets were constructed by randomly sampling one molecule as a 'founder', then filtering the remaining molecules in the database based on their Tanimoto coefficient to the 'founder' molecule. Twenty independent samples were drawn from each of the GDB, ChEMBL, and ZINC databases for Tanimoto coefficients in the range from 0 to 0.15 (GDB) or 0.2 (ChEMBL, ZINC).
Files
ChEMBL.zip
Files
(20.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:03621f805ac0a83669dbc8e64cc33607
|
5.5 GB | Preview Download |
|
md5:6aebf50d7b1aa1e909a8f41ba45c8a9e
|
6.7 GB | Preview Download |
|
md5:85e8b6fbb75cd6264258f7d25ea16205
|
148.3 MB | Preview Download |
|
md5:af323623e2232cb161c34fa08481d491
|
3.1 GB | Preview Download |
|
md5:a48f3d34259b3a05200f66cc813e8dfd
|
1.2 kB | Download |
|
md5:e796d1342d25312a79c9c65967ddfe1b
|
4.7 GB | Preview Download |