There is a newer version of the record available.

Published March 12, 2025 | Version v1
Software Open

PepScorer::RMSD: an improved Machine Learning scoring function for protein-peptide docking

Description

1. Background:

PepScorer::RMSD is a machine learning-based scoring function (SF) specifically tailored for the pose-selection task of short peptides. The need for such SF was raised from the strong interest in peptides as therapeutic entities observed in the last years and the unsatisfactory performance of protein-peptide docking, especially due to non-specific scoring functions.

2. Methods:

PepScorer::RMSD consists of a regression machine learning (ML) model that predicts the root-mean-squared deviation (RMSD) between a given pose and the corresponding native one. For the development of PepScorer::RMSD, we collected and curated a high-quality dataset of 298 protein-peptide complexes, including peptides between 3 and 10 amino acids. For each complex, we generated a set of binding poses that, together with the x-ray pose, were used to train and evaluate the model.

3. Results:

PepScorer::RMSD outperformed common, ML, and peptide-specific scoring functions, with a Pearson correlation coefficient R of 0.75, a mean absolute error (MAE) of 1.69 Å, and a top-1 DP of 96% on the single evaluation set and 81% on the curated external test set.

4. Files explanation:

The X-ray structures underwent energy minimization, treating the protein backbone and the peptide as rigid and the protein side chains as flexible. The so obtained structures were considered reference structures, and the corresponding pose was called “pepxray”. From them, through energy minimization, we generated two other structures, maintaining the protein backbone fixed, the protein side chains free to move, and the peptide either free or with a constraint of 0.5. These two generated poses were called “freelig” and “05lig”, respectively. The other poses were obtained with molecular docking, employing PLANTS or ADCP. The best 23 poses in terms of RMSD were selected for the model development.

CSV files:

1) PepScorerRMSD_proteins.csv: list of all the complexes included in the dataset, identified by their PDB ID, and annotated for the peptide and protein chain identifiers, the peptide length, and the structural group to which the complex belongs.

2) PepScorerRMSD_poses.csv: list of all the filenames of the poses, the PDB IDs, and the RMSD of the poses.

Directories:

1) Proteins:

·  Reference: reference protein structures, obtained after energy minimization with side chain flexible and peptide ligand fixed.

·  Minimization_05: protein structures obtained after energy minimization with side chain flexible and peptide ligand partially flexible (0.5 constraints).

·  Minimization_free: protein structures obtained after energy minimization with side chain and peptide ligand flexible.

2) Poses: the 23 poses for each protein.

3) PepScorerRMSD: 

· objects: directory where files for running the model are stored.

· test: test files.

· predict.py: python file to utilize the model.

. README.pdf: instructions to run the model.

· requirements.txt: python libraries required.

Files

PepScorerRMSD.zip

Files (168.2 MB)

Name Size Download all
md5:1814ea3710415bc029c6a4d4953d9fa1
13.2 MB Preview Download
md5:21553432250e12e91aa9afe0d38a2212
340.3 kB Preview Download
md5:add2a66ca6e1142b61e544b740c2042e
4.8 kB Preview Download
md5:619842a0d7c70524827f85b52afa3b05
25.5 MB Preview Download
md5:8ae69a202861d477249b184e1898c17d
129.1 MB Preview Download

Additional details

Software

Repository URL
https://github.com/andregiuseppecavalli/PepScorerRMSD
Programming language
Python
Development Status
Active