Published April 29, 2026 | Version v1.0
Software Open

Identification of Multiple Prognostic Biomarkers sets for Risk stratification in SKCM

  • 1. ROR icon Indraprastha Institute of Information Technology Delhi

Description

Title:
SKCM Prognostic Biomarker Sets – Multiple independent gene expression signatures for risk stratification in skin cutaneous melanoma

Description:

Project: SKCM Prognostic Biomarker – Identification of multiple prognostic biomarker sets for risk stratification in skin cutaneous melanoma

Publication: Malik, S., Tomer, R., Arora, A., & Raghava, G.P.S. (2026). Identification of multiple prognostic biomarkers sets for risk stratification in SKCM. Frontiers in Bioinformatics, 5, 1624329. https://doi.org/10.3389/fbinf.2025.1624329

Overview:
This repository accompanies the SKCM prognostic biomarker publication and provides seven independent, non‑overlapping sets of prognostic biomarkers for predicting overall survival (OS) in skin cutaneous melanoma (SKCM). Unlike existing studies that identify only a single biomarker signature, this work systematically generates multiple distinct gene sets (20 genes each) with no overlap, enabling flexible and robust risk stratification. SKCM is the most lethal form of skin cancer, with rising global incidence driven primarily by UV radiation exposure. Early detection and accurate risk assessment are critical for timely intervention and improved survival rates.

Dataset summary:

  • Source: TCGA (The Cancer Genome Atlas) – SKCM cohort

  • Samples: 287 patients with complete survival data

  • Original feature space: ~20,000 genes (transcriptomic profiling)

  • Survival classes (based on overall survival time):

    • Class 0: 0–1 year (29 patients – high risk)

    • Class 1: 1–3 years (110 patients – intermediate risk)

    • Class 2: 3–5 years (50 patients – intermediate risk)

    • Class 3: >5 years (98 patients – low risk/long‑term survivors)

  • Class balancing: SMOTE applied to address severe imbalance (final: 110 patients per class; 440 total for ML training)

Key Gene Biomarkers Identified (Primary Set – 20 genes):

TEKT5, ZNF154, H2AC14, BX284668.6, MYCNOS, STUM, SERTM2, RPSAP18, REG4, PSCA, PAEP, ACTR3C, MSLN, MRPS18AP1, ISLR, IL37, IGLV3.16, H2BC11, GPR25, MTND4P35

Notable biomarkers with literature support in SKCM:

  • PSCA (Prostate Stem Cell Antigen): Upregulated in SKCM; positively correlated with RNA modification genes

  • PAEP (Progestogen‑Associated Endometrial Protein): Upregulated; correlates with poor survival outcomes

  • ACTR3C: Mutations (p.Gly58Arg, p.Gly58Glu) may alter cell signaling/motility in melanoma progression

  • MSLN: Downregulated in SKCM

  • ISLR: Downregulated; negatively correlated with tumor mutation load (TMB)

  • STUM: Biomarker for response to Nivolumab (immune checkpoint inhibitor)

  • SERTM2: Gene alterations frequently observed in melanoma (tumor progression, immune evasion)

LASSO Cox regression – 17 prognosis‑related genes:

 
 
Gene Coefficient Risk/Beneficial Known role in SKCM
ATP11A 0.035 Risk
B2M -0.037 Beneficial Associated with worse OS (hazardous factor; immune evasion)
BISPR -0.028 Beneficial
CIB2 0.005 Risk
CYTL1 0.082 Risk Progressively upregulated in normal skin → nevi → melanoma
GBP1P1 -0.020 Beneficial
GBP2 -0.015 Beneficial Downregulated; low expression correlates with poor prognosis
GCA -0.016 Beneficial
HEXD -0.042 Beneficial
HLA‑DQB1 -0.033 Beneficial Upregulated in cutaneous melanoma
KLRC1 -0.028 Beneficial Associated with immune cell infiltration (NKG2A)
LRRK2.DT -0.036 Beneficial
MCOLN2 -0.018 Beneficial Upregulated in other cancers (prostate); potential role in progression
SLC2A5 -0.001 Beneficial Protective factor in SKCM, SARC, THCA
TTYH2 0.049 Risk Negatively correlated with OS (r = -0.23)
WIPF1 -0.020 Beneficial Upregulated in early melanoma
XCL2 -0.005 Beneficial High expression in SKCM

Drug‑gene interactions (DGIdb): 17 drugs identified targeting 5 of the selected genes (mostly inhibitors). Detailed in Supplementary Table S9.

Correlation analysis (Pearson) – top correlated genes with OS:

 
 
Positive correlation (higher expression → longer survival) Negative correlation (higher expression → shorter survival)
CREG1 (r = 0.40, p = 3.63E-12) AC008687.4 (r = -0.23, p = 8.28E-05)
PCGF5 (r = 0.38, p = 4.47E-11) TTYH2 (r = -0.23, p = 0.00010)
VPS13C (r = 0.36, p = 2.06E-10) G6PC3 (r = -0.21, p = 0.00025)
CPD (r = 0.35, p = 1.17E-09) BOK (r = -0.21, p = 0.00033)

Univariate Cox regression: 4,324 genes significantly associated with OS (p < 0.01).

  • 1,264 genes – HR > 1 (risk factors)

  • 3,060 genes – HR < 1 (protective factors)

GO enrichment (top 50 positively correlated genes):

  • Biological processes: lipid translocation (p = 0.00002), phospholipid translocation (p = 0.00006)

  • Cellular components: mitochondrial outer membrane (p = 0.0002), organelle outer membrane (p = 0.0004)

  • Molecular functions: cysteine‑type endopeptidase inhibitor activity (p = 0.001), protein serine/threonine kinase activity (p = 0.001)

Reactome pathways: Ion Transport By P‑type ATPases (p = 0.0003), Ion Channel Transport (p = 0.0009), plus RNA Polymerase I Promoter Opening, Packaging of Telomere Ends, Post‑translational Protein Modification, Metabolism of Proteins.

 

Model Performance (CatBoost – best performing classifier):

Primary biomarker set (20 genes):

 
 
Dataset Accuracy AUC Sensitivity Specificity Kappa MCC
Training 0.65 0.89 0.65 0.65 0.54 0.53
Test 0.68 0.90 0.68 0.68 0.58 0.58

Class‑wise AUC (Primary set – Test):

 
 
Class Survival range AUC
Class 0 0–1 year (high risk) 0.99
Class 1 1–3 years 0.83
Class 2 3–5 years 0.93
Class 3 >5 years (long‑term) 0.84

Performance across all seven biomarker sets (Test – AUC):

 
 
Biomarker Set Accuracy AUC Kappa MCC
Primary 0.68 0.90 0.58 0.58
Secondary 0.68 0.89 0.58 0.58
Third 0.66 0.87 0.56 0.55
Fourth 0.60 0.85 0.48 0.47
Fifth 0.73 0.91 0.64 0.64
Sixth 0.56 0.84 0.41 0.41
Seventh 0.47 0.89 0.30 0.29

Best overall: Fifth biomarker set – AUC 0.91, Kappa 0.64, MCC 0.64

Class‑wise AUC across all seven sets (Test):

 
 
Biomarker Set Class 0 (0–1y) Class 1 (1–3y) Class 2 (3–5y) Class 3 (>5y)
Primary 0.99 0.83 0.93 0.84
Secondary 0.99 0.77 0.90 0.90
Third 0.98 0.80 0.93 0.73
Fourth 0.90 0.80 0.90 0.82
Fifth 0.99 0.85 0.91 0.90
Sixth 0.92 0.75 0.90 0.77
Seventh 0.95 0.82 0.93 0.85

Key observations:

  • Class 0 (high‑risk patients): Exceptional prediction across all sets (AUC 0.90–0.99) – clinically valuable for identifying patients needing immediate intervention

  • Class 2 (3–5 years): Consistently high AUC (0.88–0.93) – robust intermediate survival prediction

  • Class 1 (1–3 years): Moderate performance (AUC 0.75–0.85) – challenging mid‑range group

  • Class 3 (>5 years): Variable performance (AUC 0.73–0.90) – requires further refinement

Ensemble Models (Primary set – 20 genes):

 
 
Ensemble Method Base models Meta‑model Test AUC Test MCC
Voting RF + ET + LightGBM Hard voting 0.87 0.56
Stacking RF + ET Logistic Regression 0.88 0.61
Stacking RF + ET + XGB CatBoost 0.88 0.52

External validation (GEO dataset GSE65904):

 
 
Biomarker Set Matched genes Training AUC Test AUC
Primary set 15 genes 0.85 0.83
Third set 12 genes 0.85 0.86

Class‑wise AUC on GSE65904 (Third set):

 
 
Class AUC
Class 0 (0–1y) 0.79
Class 1 (1–3y) 0.78
Class 2 (3–5y) 0.90
Class 3 (>5y) 0.97

Comparison with existing methods:

 
 
Method Focus AUC
Our study – Fifth set Multiple independent gene sets 0.91
Our study – Primary set Multiple independent gene sets 0.90
Yang et al. (2023) Invasion‑associated genes (IAGS) 0.88
Geng et al. (2023) NLR‑related genes ~0.85
Ding et al. (2023) Chemokine‑related (14 genes) ~0.84
Ping et al. (2022) Ferroptosis‑related (10 genes) ~0.82
Xiao et al. (2021) Immune‑related lncRNAs (8 genes) ~0.80

Our approach (multiple independent biomarker sets) outperforms existing single‑signature methods, with 4 out of 7 sets achieving AUC > 0.88.

Feature Selection Methods Evaluated:

 
 
Method Description
SVC‑L1 Linear SVM with L1 penalty – primary method for biomarker identification
Recursive Feature Elimination (RFE) Recursively ranks features, selects optimal subset
Sequential Feature Selection (SFS) Sequentially adds/removes features to optimize performance
SelectKBest Selects K most relevant features using f_classif

Machine Learning Algorithms Evaluated:

CatBoost (best), Extra Trees (ET), Random Forest (RF), XGBoost (XGB), LightGBM, SVM, KNN, MLP, AdaBoost, Gradient Boosting, Logistic Regression

Data Curation & Quality Control:

  • Source: TCGA SKCM cohort via TCGAbiolinks (R) / Xena Browser

  • Samples: 287 patients with complete survival data

  • Normalization: Z‑score scaling (mean = 0, SD = 1)

  • Class definition: Based on overall survival time (months)

  • Class balancing: SMOTE (Synthetic Minority Over‑sampling Technique)

  • Validation split: Stratified 80/20 (5‑fold CV) – each class represented equally

  • External validation: GEO GSE65904 (independent melanoma cohort)

  • Multiple testing correction: Benjamini‑Hochberg FDR (p < 0.05)

Usage:
These datasets and models are designed for:

  • Identifying multiple independent prognostic biomarker sets for SKCM

  • Risk stratification of SKCM patients (high‑risk vs. long‑term survivors)

  • Training and benchmarking machine learning classifiers (CatBoost, RF, ET, XGB) for survival prediction

  • Drug target discovery (DGIdb analysis of prognostic genes)

  • External validation of biomarker signatures on independent melanoma cohorts

  • Understanding molecular determinants (lipid translocation, ion transport, mitochondrial function) of melanoma prognosis

Key Biological Insights:

  • CREG1, PCGF5, VPS13C – high expression associated with longer survival (positive correlation)

  • TTYH2, AC008687.4 – high expression associated with shorter survival (negative correlation)

  • Lipid/phospholipid translocation and mitochondrial outer membrane pathways enriched in positively correlated genes

  • Ion transport pathways (P‑type ATPases, ion channels) significantly enriched

Related Resources:

License: CC BY 4.0 (as per Frontiers open access license)

Contact:
Prof. Gajendra P. S. Raghava

Files

raghavagps/skcm_prognostic_biomarker-v1.0.zip

Files (279.9 MB)

Name Size Download all
md5:f39fa18fb5b71e46e61914d858f75a7c
279.8 MB Preview Download
md5:aad2a66c90d5ff979ed7a1bbec3e48a2
34.0 kB Preview Download

Additional details

Related works