Published April 30, 2026 | Version v1.0
Software Open

HOPPred – Prediction of peptide hormones using an ensemble of machine learning and similarity‑based methods

  • 1. ROR icon Indraprastha Institute of Information Technology Delhi

Description

Title:
HOPPred Dataset – Experimentally validated peptide hormones and non‑hormonal peptides

Description:

Project: HOPPred – Prediction of peptide hormones using an ensemble of machine learning and similarity‑based methods

Publication: Kaur, D., Arora, A., Vigneshwar, P., & Raghava, G.P.S. (2024). Prediction of peptide hormones using an ensemble of machine learning and similarity‑based methods. Proteomics, 24, e2400004. https://doi.org/10.1002/pmic.202400004

Overview: This dataset accompanies HOPPred, the first computational tool for predicting peptide hormones. Peptide hormones are genome‑encoded signal transduction molecules essential for regulating growth, development, and homeostasis; their dysregulation leads to endocrine disorders (e.g., diabetes, neoplasia). The dataset is curated from Hmrbase2 and other sources, balanced (1,174 hormonal + 1,174 non‑hormonal peptides), and redundancy‑reduced (CD‑HIT at 90% similarity).

Content:

Dataset Peptides
Hormonal (positive) 1,174
Non‑hormonal (negative) 1,174
Total 2,348

Key Findings – Compositional analysis (hormonal peptides enriched in):

  • Cysteine (C), Aspartic acid (D), Phenylalanine (F), Glycine (G), Arginine (R), Serine (S), Asparagine (N), Proline (P), Tyrosine (Y) – statistically significant (Mann‑Whitney U, p < 0.05)

  • Non‑hormonal enriched in: Glutamic acid (E), Isoleucine (I), Leucine (L), Methionine (M), Glutamine (Q), Lysine (K), Threonine (T), Valine (V)

Exclusive motifs in hormonal peptides (MERCI): FGPR, WFGP, WFGPR, FGPRL, GPRL, WFGP, MWFGPRL, LCGS (LCGS is known motif in Insulin chain B)

 

Best Model Performance (validation set – 20% held out):

 
Model AUC MCC Accuracy Sensitivity Specificity
Ensemble (LR + Motif + BLAST) 0.96 0.80 89.8% 90.1% 89.5%
LR (ML alone – top 50 features) 0.93 0.72 86.0% 85.3% 86.6%
TextCNN (DL) 0.90 0.67 83.0% 87.0% 79.0%
RF (ML – top 50 features) 0.90 0.64 82.1% 80.2% 84.0%
TabNet (DL) 0.75 0.57 74.0% 73.0% 75.0%

Top features align with motifs: DPC1_CF (Cys‑Phe), TPC_FRP (Phe‑Arg‑Pro), TPC_GNF (Gly‑Asn‑Phe), TPC_LMG (Leu‑Met‑Gly), TPC_RGL (Arg‑Gly‑Leu) – overlapping with motifs FGPR, WFGPRL, etc., confirming biological relevance.

Data Curation & Quality Control:

  • Source: Hmrbase2 (hormone database) + PeptideAtlas + UniProt/Swiss‑Prot

  • Redundancy reduction: CD‑HIT at 90% sequence identity

  • Negative set: Randomly selected from Swiss‑Prot excluding known hormones

  • Train/validation split: 80/20 (5‑fold CV on training)

  • Feature selection: RFE (Recursive Feature Elimination) with Logistic Regression as estimator

Usage: Predicting peptide hormones from sequence, designing novel hormone peptides (Design module), scanning protein sequences for hormone regions (Protein Scan module), identifying hormone‑associated motifs, developing peptide‑based therapeutics and endocrine disorder treatments.

Related Resources: Web server: https://webs.iiitd.edu.in/raghava/hoppred/ | GitHub: https://github.com/raghavagps/HOPPRED

Contact: raghava@iiitd.ac.in (Gajendra P. S. Raghava)

Files

raghavagps/hoppred-v1.0.zip

Files (167.9 kB)

Name Size Download all
md5:04f3c0fd3f96250c820216afdbdcd176
167.9 kB Preview Download

Additional details

Related works

Is supplement to
Software: https://github.com/raghavagps/hoppred/tree/v1.0 (URL)