CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning

doi:10.5281/zenodo.8010582

Published May 13, 2023 | Version 1.0.1

Dataset Open

CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning

1. Prescient Design, Genentech
2. Department of Peptide Therapeutics, Genentech
3. Biology Research | AI Development, Genentech

CREMP: A resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclic peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST). Altogether, this dataset contains nearly 31.3 million unique macrocycle geometries, each annotated with energies derived from semi-empirical tight-binding DFT calculations. We anticipate that this dataset will enable the development of machine learning models that can improve peptide design and optimization for novel therapeutics.

We provide the data in two available formats, either as Python pickle files, which provide quick read access with RDKit version 2022.09.5 or later, and as text-based SDF files with associated metadata in JSON format. Each file is named based on its amino acid sequence, with residues separated by periods, using standard one-letter codes with lowercase letters representing D-amino acids and "Me" prefixes representing N-methylated amino acids. The sequences are in no particular order, e.g., "C.R.E.M.P" and "R.E.M.P.C" correspond to the same peptide macrocycle. The filename extensions are ".pickle", ".sdf", and ".json".

Each file in the “pickle” folder contains a Python dictionary with amino acid sequence, SMILES, CREST metadata, and a single RDKit molecule object containing all conformers. All files in the folder were compressed into a single “pickle.tar.gz” archive. In the “sdf_and_json” folder, each individual SDF file contains all conformers, each associated with its own JSON file that contains CREST metadata. Similarly, all are compressed into another single archive, “sdf_and_json.tar.bz2”. A single summary CSV file is also provided containing ”sequence”, “smiles”, “num_monomers”, “num_atoms”, “num_heavy_atoms”, along with the CREST metadata “totalconfs”, “uniqueconfs”, “lowestenergy”, “poplowestpct”, “temperature”, “ensembleenergy”, “ensembleentropy”, and “ensemblefreeenergy”. The number of unique conformers with different 3D structures is given by “uniqueconfs”, while “totalconfs” includes the number of rotamers in addition.

The unzipped sizes of the archives are approximately 32 GB for "pickle.tar.gz" and 210 GB for "sdf_and_json.tar.bz2". If you encounter errors when trying to load the pickle files, please make sure your RDKit version is at least 2022.09.5. If that doesn't work, try other Python versions.

Notes

This version fixes the corrupted pickle.tar.gz file.

Files

summary.csv

Files (49.5 GB)

Name	Size	Download all
pickle.tar.gz md5:925d058e9d96942e5aca55b12480efc3	28.9 GB	Download
sdf_and_json.tar.bz2 md5:e010ecc8f5a54d886695a46b175a3a7a	20.5 GB	Download
summary.csv md5:6941dd885a38abefbc651bd0299c1719	6.0 MB	Preview Download

Additional details

Is cited by: Preprint: 10.48550/arXiv.2305.19800 (DOI)
Is published in: Journal article: 10.1038/s41597-024-03698-y (DOI)

Repository URL: https://github.com/Genentech/cremp
Programming language: Python

	All versions	This version
Views	1,605	661
Downloads	1,395	685
Data volume	25.8 TB	9.6 TB

CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning

Notes

Files

summary.csv

Files (49.5 GB)

Additional details

Related works

Software

CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning

Creators

Description

Notes

Files

summary.csv

Files (49.5 GB)

Additional details

Related works

Software