CPSea
Description
π§ We highly recommend using the new version of CPSea which we will release soon! Current version is with some issues about hydrogen geometry and chirality!
π Hi, this is CPSea, a large scale cyclic peptide protein complex database. Containing 2.64M complex structures, CPSea enables training cyclic peptide design models from scratch for the first time. We hope that everyone can find helpful parts from CPSea π
πΊοΈ This site and Kaggle site store the main body of CPsea, as well as some list files and metadata for different subsets. Please check our GitHub for relevant scripts for generation and evaluation of CPSea.
Dataset
π‘ Below is the structure of the dataset:
. βββ CPBind/ β βββ CPBind_index.txt β βββ CPBind_pdb/ β β βββ CPBind_pdb.1 β β βββ CPBind_pdb.2 β β βββ CPBind_pdb.3 β β βββ CPBind_pdb.4 β βββ CPBind_Properties/ β βββ CPBind_affinity.csv β βββ CPBind_basic.tsv β βββ CPBind_cluster.tsv β βββ CPBind_hydrophobic.csv β βββ CPBind_validity.csv βββ CPBind_unique/ β βββ CPBind_unique_index.txt β βββ CPBind_unique_pdb β βββ CPBind_unique_Properties/ β βββ CPBind_unique_affinity.csv β βββ CPBind_unique_basic.tsv β βββ CPBind_unique_hydrophobic.csv β βββ CPBind_unique_validity.csv βββ CPSea_AFDB/... βββ CPSea_AFDB_unique/... βββ CPTrans/... βββ CPTrans_unique/... βββ CPCore/... βββ CPCore_unique/... βββ CPSea_PDB/... βββ CPSea_PDB_unique/... βββ Weight/ βββ DiffPepBuilder_Cyc.pth βββ PepFlow_Cyc.pt βββ PepGlad_Cyc.ckpt
We provide 10 datasets, including CPSea derived from AFDB and PDB, and 3 subsets of CPSea_AFDB. These 5 datasets are clustered based on Foldseek, and we select one complex from every cluster, to create corresponding _unique datasets. Basic information of the provided datasets are listed below:
Subset | Scale | Data Source | Description |
CPSea_AFDB | 2,636,249 | AFDB | main set generated from AFDB |
CPSea_AFDB_unique | 574,878 | AFDB | CPSea_AFDB after removing redundancy |
CPBind | 476,590 | AFDB | Rosetta ddG < -25 and Vina score < -6 |
CPBind_unique | 101,182 | AFDB | CPBind after removing redundancy |
CPTrans | 527,777 | AFDB | logP > -6 and GRAVY < 0 |
CPTrans_unique | 157,088 | AFDB | CPTrans after removing redundancy |
CPCore| | 51,820 | AFDB | basically intersection of CPBind and CPCore |
CPCore_unique | 22,881 | AFDB | CPCore after removing redundancy |
CPSea_PDB | 11,511 | PDB | main set generated from PDB |
CPSea_PDB_unique | 5,482 | PDB | CPSea_PDB after removing redundancy |
In each dataset, we provide an index file, structure tar file(s), and property files.
- In index file, a list of file names is provided, which corresponds to files in the structure tar file. These names are in a format of
{original_structure_id}_{first_residue}_{last_residue}_relaxed
. Therefore, the original structure from which each cyclic peptide is derived can be known from the file. - For structure tar file(s), we use _pdb suffix as identifier (e.g., CPBind_unique_pdb). The .tar extension is removed, and large tar files are chunked into several files for efficient processing in Kaggle (e.g. CPSea_AFDB_pdb.1, CPSea_AFDB_pdb.2, etc.). Though there are no file extension, these files are tar files, and you may run
tar -xvf <structure_tar_file>
to extract pdb files. - For the property files, we provide 5 kinds of csv/tsv files, as described below:
File name | Description |
basic | Basic info correspond to filter metrics during initial cyclic peptide identification |
affinity | Affinity evaluation based on Rosetta and Vina |
validity | The ramachadran plot and the number of different types of interaction in interfaces |
hydrophobic | Metrics on hydrophobicity and membrane permeability |
cluster | Multimer clustering results based on Foldseek |
Weights
We provide checkpoint file for the three models re-trained on CPSea in Weights
, namely: DiffPepBuilder_Cyc.pth
, PepFlow_Cyc.pt
and PepGlad_Cyc.ckpt
.
Please check out our GitHub for instructions on how to generate and evaluate cyclic peptides using these re-trained weights.
Files
CPBind.zip
Files
(159.2 GB)
Name | Size | Download all |
---|---|---|
md5:858c231d5c0e2e0a66ca8061de220cc0
|
19.3 GB | Preview Download |
md5:bbe7ec3505d0bc88d3a45ca4fcad63c5
|
3.9 GB | Preview Download |
md5:cf27d687fa16aed2ee0cb8a3d981626e
|
2.0 GB | Preview Download |
md5:58bb922418990a13d1cd215a32b02386
|
856.9 MB | Preview Download |
md5:e00a25af340480b0c67f68998bfe874d
|
89.4 GB | Preview Download |
md5:086b437e8480a75aa008ca0d130ec9c0
|
18.9 GB | Preview Download |
md5:74cc7fae04cc660b504f1fe253658f72
|
493.6 MB | Preview Download |
md5:e3cf09ad589365ae00166bfcf4811524
|
228.6 MB | Preview Download |
md5:122cee193fd135cd2751721ee89fd76e
|
17.9 GB | Preview Download |
md5:1fd710fdfe05a88600b86cf88ba0bffc
|
5.2 GB | Preview Download |
md5:e30f06e3387d95e2a692bd10db90a3f7
|
1.2 GB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/YZY010418/CPSea
- Programming language
- Python