HOPPred – Prediction of peptide hormones using an ensemble of machine learning and similarity‑based methods
Authors/Creators
Description
Title:
HOPPred Dataset – Experimentally validated peptide hormones and non‑hormonal peptides
Description:
Project: HOPPred – Prediction of peptide hormones using an ensemble of machine learning and similarity‑based methods
Publication: Kaur, D., Arora, A., Vigneshwar, P., & Raghava, G.P.S. (2024). Prediction of peptide hormones using an ensemble of machine learning and similarity‑based methods. Proteomics, 24, e2400004. https://doi.org/10.1002/pmic.202400004
Overview: This dataset accompanies HOPPred, the first computational tool for predicting peptide hormones. Peptide hormones are genome‑encoded signal transduction molecules essential for regulating growth, development, and homeostasis; their dysregulation leads to endocrine disorders (e.g., diabetes, neoplasia). The dataset is curated from Hmrbase2 and other sources, balanced (1,174 hormonal + 1,174 non‑hormonal peptides), and redundancy‑reduced (CD‑HIT at 90% similarity).
Content:
| Dataset | Peptides |
|---|---|
| Hormonal (positive) | 1,174 |
| Non‑hormonal (negative) | 1,174 |
| Total | 2,348 |
Key Findings – Compositional analysis (hormonal peptides enriched in):
-
Cysteine (C), Aspartic acid (D), Phenylalanine (F), Glycine (G), Arginine (R), Serine (S), Asparagine (N), Proline (P), Tyrosine (Y) – statistically significant (Mann‑Whitney U, p < 0.05)
-
Non‑hormonal enriched in: Glutamic acid (E), Isoleucine (I), Leucine (L), Methionine (M), Glutamine (Q), Lysine (K), Threonine (T), Valine (V)
Exclusive motifs in hormonal peptides (MERCI): FGPR, WFGP, WFGPR, FGPRL, GPRL, WFGP, MWFGPRL, LCGS (LCGS is known motif in Insulin chain B)
Best Model Performance (validation set – 20% held out):
| Model | AUC | MCC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|---|
| Ensemble (LR + Motif + BLAST) | 0.96 | 0.80 | 89.8% | 90.1% | 89.5% |
| LR (ML alone – top 50 features) | 0.93 | 0.72 | 86.0% | 85.3% | 86.6% |
| TextCNN (DL) | 0.90 | 0.67 | 83.0% | 87.0% | 79.0% |
| RF (ML – top 50 features) | 0.90 | 0.64 | 82.1% | 80.2% | 84.0% |
| TabNet (DL) | 0.75 | 0.57 | 74.0% | 73.0% | 75.0% |
Top features align with motifs: DPC1_CF (Cys‑Phe), TPC_FRP (Phe‑Arg‑Pro), TPC_GNF (Gly‑Asn‑Phe), TPC_LMG (Leu‑Met‑Gly), TPC_RGL (Arg‑Gly‑Leu) – overlapping with motifs FGPR, WFGPRL, etc., confirming biological relevance.
Data Curation & Quality Control:
-
Source: Hmrbase2 (hormone database) + PeptideAtlas + UniProt/Swiss‑Prot
-
Redundancy reduction: CD‑HIT at 90% sequence identity
-
Negative set: Randomly selected from Swiss‑Prot excluding known hormones
-
Train/validation split: 80/20 (5‑fold CV on training)
-
Feature selection: RFE (Recursive Feature Elimination) with Logistic Regression as estimator
Usage: Predicting peptide hormones from sequence, designing novel hormone peptides (Design module), scanning protein sequences for hormone regions (Protein Scan module), identifying hormone‑associated motifs, developing peptide‑based therapeutics and endocrine disorder treatments.
Related Resources: Web server: https://webs.iiitd.edu.in/raghava/hoppred/ | GitHub: https://github.com/raghavagps/HOPPRED
Contact: raghava@iiitd.ac.in (Gajendra P. S. Raghava)
Files
raghavagps/hoppred-v1.0.zip
Files
(167.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:04f3c0fd3f96250c820216afdbdcd176
|
167.9 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/raghavagps/hoppred/tree/v1.0 (URL)
Software
- Repository URL
- https://github.com/raghavagps/hoppred