Published August 15, 2025 | Version v2
Other Open

CPSea

  • 1. ROR icon Tsinghua University

Description

🚧 We highly recommend using the new version of CPSea which we will release soon! Current version is with some issues about hydrogen geometry and chirality!

😊 Hi, this is CPSea, a large scale cyclic peptide protein complex database. Containing 2.64M complex structures, CPSea enables training cyclic peptide design models from scratch for the first time. We hope that everyone can find helpful parts from CPSea 🌊

πŸ—ΊοΈ This site and Kaggle site store the main body of CPsea, as well as some list files and metadata for different subsets. Please check our GitHub for relevant scripts for generation and evaluation of CPSea.

Dataset

πŸ’‘ Below is the structure of the dataset:

.
β”œβ”€β”€ CPBind/
β”‚   β”œβ”€β”€ CPBind_index.txt
β”‚   β”œβ”€β”€ CPBind_pdb/
β”‚   β”‚   β”œβ”€β”€ CPBind_pdb.1
β”‚   β”‚   β”œβ”€β”€ CPBind_pdb.2
β”‚   β”‚   β”œβ”€β”€ CPBind_pdb.3
β”‚   β”‚   └── CPBind_pdb.4
β”‚   └── CPBind_Properties/
β”‚       β”œβ”€β”€ CPBind_affinity.csv
β”‚       β”œβ”€β”€ CPBind_basic.tsv
β”‚       β”œβ”€β”€ CPBind_cluster.tsv
β”‚       β”œβ”€β”€ CPBind_hydrophobic.csv
β”‚       └── CPBind_validity.csv
β”œβ”€β”€ CPBind_unique/
β”‚   β”œβ”€β”€ CPBind_unique_index.txt
β”‚   β”œβ”€β”€ CPBind_unique_pdb
β”‚   └── CPBind_unique_Properties/
β”‚       β”œβ”€β”€ CPBind_unique_affinity.csv
β”‚       β”œβ”€β”€ CPBind_unique_basic.tsv
β”‚       β”œβ”€β”€ CPBind_unique_hydrophobic.csv
β”‚       └── CPBind_unique_validity.csv
β”œβ”€β”€ CPSea_AFDB/...
β”œβ”€β”€ CPSea_AFDB_unique/...
β”œβ”€β”€ CPTrans/...
β”œβ”€β”€ CPTrans_unique/...
β”œβ”€β”€ CPCore/...
β”œβ”€β”€ CPCore_unique/...
β”œβ”€β”€ CPSea_PDB/...
β”œβ”€β”€ CPSea_PDB_unique/...
└── Weight/
    β”œβ”€β”€ DiffPepBuilder_Cyc.pth
    β”œβ”€β”€ PepFlow_Cyc.pt
    └── PepGlad_Cyc.ckpt

We provide 10 datasets, including CPSea derived from AFDB and PDB, and 3 subsets of CPSea_AFDB. These 5 datasets are clustered based on Foldseek, and we select one complex from every cluster, to create corresponding _unique datasets. Basic information of the provided datasets are listed below:

Subset Scale Data Source Description
CPSea_AFDB 2,636,249 AFDB main set generated from AFDB
CPSea_AFDB_unique 574,878 AFDB CPSea_AFDB after removing redundancy
CPBind 476,590 AFDB Rosetta ddG < -25 and Vina score < -6
CPBind_unique 101,182 AFDB CPBind after removing redundancy
CPTrans 527,777 AFDB logP > -6 and GRAVY < 0
CPTrans_unique 157,088 AFDB CPTrans after removing redundancy
CPCore| 51,820 AFDB basically intersection of CPBind and CPCore
CPCore_unique 22,881 AFDB CPCore after removing redundancy
CPSea_PDB 11,511 PDB main set generated from PDB
CPSea_PDB_unique 5,482 PDB CPSea_PDB after removing redundancy

 

In each dataset, we provide an index file, structure tar file(s), and property files

  • In index file, a list of file names is provided, which corresponds to files in the structure tar file. These names are in a format of {original_structure_id}_{first_residue}_{last_residue}_relaxed. Therefore, the original structure from which each cyclic peptide is derived can be known from the file.
  • For structure tar file(s), we use _pdb suffix as identifier (e.g., CPBind_unique_pdb). The .tar extension is removed, and large tar files are chunked into several files for efficient processing in Kaggle (e.g. CPSea_AFDB_pdb.1, CPSea_AFDB_pdb.2, etc.). Though there are no file extension, these files are tar files, and you may run tar -xvf <structure_tar_file> to extract pdb files.
  • For the property files, we provide 5 kinds of csv/tsv files, as described below:
File name Description
basic Basic info correspond to filter metrics during initial cyclic peptide identification
affinity Affinity evaluation based on Rosetta and Vina
validity The ramachadran plot and the number of different types of interaction in interfaces
hydrophobic Metrics on hydrophobicity and membrane permeability
cluster Multimer clustering results based on Foldseek

Weights

We provide checkpoint file for the three models re-trained on CPSea in Weights, namely: DiffPepBuilder_Cyc.pth, PepFlow_Cyc.pt and PepGlad_Cyc.ckpt.

Please check out our GitHub for instructions on how to generate and evaluate cyclic peptides using these re-trained weights.

Files

CPBind.zip

Files (159.2 GB)

Name Size Download all
md5:858c231d5c0e2e0a66ca8061de220cc0
19.3 GB Preview Download
md5:bbe7ec3505d0bc88d3a45ca4fcad63c5
3.9 GB Preview Download
md5:cf27d687fa16aed2ee0cb8a3d981626e
2.0 GB Preview Download
md5:58bb922418990a13d1cd215a32b02386
856.9 MB Preview Download
md5:e00a25af340480b0c67f68998bfe874d
89.4 GB Preview Download
md5:086b437e8480a75aa008ca0d130ec9c0
18.9 GB Preview Download
md5:74cc7fae04cc660b504f1fe253658f72
493.6 MB Preview Download
md5:e3cf09ad589365ae00166bfcf4811524
228.6 MB Preview Download
md5:122cee193fd135cd2751721ee89fd76e
17.9 GB Preview Download
md5:1fd710fdfe05a88600b86cf88ba0bffc
5.2 GB Preview Download
md5:e30f06e3387d95e2a692bd10db90a3f7
1.2 GB Preview Download

Additional details

Software

Repository URL
https://github.com/YZY010418/CPSea
Programming language
Python