Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

There is a newer version of the record available.

Published January 5, 2021 | Version v1
Dataset Open

Molecules used to train generative models

  • 1. University of British Columbia

Description

This directory contains sets of molecules used to train chemical language models in the paper, "Learning generative models of molecules from limited training examples."

Between 1,000 and 500,000 molecules were sampled from each of four chemical databases (ChEMBL, COCONUT, GDB, and ZINC). These molecules were represented using either the SMILES, DeepSMILES, or SELFIES formats. For molecules in the SMILES format, data augmentation was also performed by enumerating non-canonical SMILES, with augmentation factors of 3x, 10x, or 30x. For each training dataset size, ten independent samples were drawn to assess variability.

Files

ChEMBL.zip

Files (20.0 GB)

Name Size Download all
md5:03621f805ac0a83669dbc8e64cc33607
5.5 GB Preview Download
md5:6aebf50d7b1aa1e909a8f41ba45c8a9e
6.7 GB Preview Download
md5:af323623e2232cb161c34fa08481d491
3.1 GB Preview Download
md5:871763fbf4d1ce3f26e04cae39ac0d7d
626 Bytes Download
md5:e796d1342d25312a79c9c65967ddfe1b
4.7 GB Preview Download