Published March 17, 2025 | Version 1.1
Dataset Open

Pretraining data for PeptideCLM (UPDATED)

  • 1. ROR icon The University of Texas at Austin

Contributors

Researcher:

  • 1. ROR icon The University of Texas at Austin

Description

This version update includes changes to Generated_peptides.csv to fix cyclization. The prior upload did not have ring closures generated correctly as SMILES strings. The model in the publication was trained on the dataset containing errors, however to support the community we decided it would be best to release a 10M peptide SMILES dataset for use in future pretraining applications. All strings should now load correctly to mol files with RDKit.

Files

Generated_peptides.csv

Files (14.0 GB)

Name Size Download all
md5:f891628037f968145a7fdc0b8b099f8c
10.8 GB Preview Download
md5:e3e045b4a2c18a84d1134f261063f031
763.4 MB Download
md5:c2b81725a458a9b38e49f6e72bc110cd
455.7 MB Preview Download
md5:d683dc67487320dddb3a105faa2da2f0
743.0 MB Preview Download
md5:c595e8175d42c65d94801791962f712a
1.2 GB Preview Download

Additional details

Dates

Available
2024-11-20