Molecules used to train generative models

Skinnider, Michael

doi:10.5281/zenodo.4641960

Published January 5, 2021 | Version v2

Dataset Open

Molecules used to train generative models

Skinnider, Michael¹

1. University of British Columbia

This directory contains sets of molecules used to train chemical language models in the paper, "Learning generative models of molecules from limited training examples."

Between 1,000 and 500,000 molecules were sampled from each of four chemical databases (ChEMBL, COCONUT, GDB, and ZINC). These molecules were represented using either the SMILES, DeepSMILES, or SELFIES formats. For molecules in the SMILES format, data augmentation was also performed by enumerating non-canonical SMILES, with augmentation factors of 3x, 10x, or 30x. For each training dataset size, ten independent samples were drawn to assess variability.

Separately, the Diversity.zip archive contains samples of between 1,000 and 10,000 molecules used to evaluate the impact of chemical diversity on model performance. These training sets were constructed by randomly sampling one molecule as a 'founder', then filtering the remaining molecules in the database based on their Tanimoto coefficient to the 'founder' molecule. Twenty independent samples were drawn from each of the GDB, ChEMBL, and ZINC databases for Tanimoto coefficients in the range from 0 to 0.15 (GDB) or 0.2 (ChEMBL, ZINC).

Files

ChEMBL.zip

Files (20.1 GB)

Name	Size	Download all
ChEMBL.zip md5:03621f805ac0a83669dbc8e64cc33607	5.5 GB	Preview Download
COCONUT.zip md5:6aebf50d7b1aa1e909a8f41ba45c8a9e	6.7 GB	Preview Download
Diversity.zip md5:85e8b6fbb75cd6264258f7d25ea16205	148.3 MB	Preview Download
GDB.zip md5:af323623e2232cb161c34fa08481d491	3.1 GB	Preview Download
README md5:a48f3d34259b3a05200f66cc813e8dfd	1.2 kB	Download
ZINC.zip md5:e796d1342d25312a79c9c65967ddfe1b	4.7 GB	Preview Download

	All versions	This version
Views	570	379
Downloads	570	412
Data volume	2.1 TB	1.5 TB

Molecules used to train generative models

Authors/Creators

Description

Files

ChEMBL.zip

Files (20.1 GB)