Identification of Multiple Prognostic Biomarkers sets for Risk stratification in SKCM
Authors/Creators
Description
Title:
SKCM Prognostic Biomarker Sets – Multiple independent gene expression signatures for risk stratification in skin cutaneous melanoma
Description:
Project: SKCM Prognostic Biomarker – Identification of multiple prognostic biomarker sets for risk stratification in skin cutaneous melanoma
Publication: Malik, S., Tomer, R., Arora, A., & Raghava, G.P.S. (2026). Identification of multiple prognostic biomarkers sets for risk stratification in SKCM. Frontiers in Bioinformatics, 5, 1624329. https://doi.org/10.3389/fbinf.2025.1624329
Overview:
This repository accompanies the SKCM prognostic biomarker publication and provides seven independent, non‑overlapping sets of prognostic biomarkers for predicting overall survival (OS) in skin cutaneous melanoma (SKCM). Unlike existing studies that identify only a single biomarker signature, this work systematically generates multiple distinct gene sets (20 genes each) with no overlap, enabling flexible and robust risk stratification. SKCM is the most lethal form of skin cancer, with rising global incidence driven primarily by UV radiation exposure. Early detection and accurate risk assessment are critical for timely intervention and improved survival rates.
Dataset summary:
-
Source: TCGA (The Cancer Genome Atlas) – SKCM cohort
-
Samples: 287 patients with complete survival data
-
Original feature space: ~20,000 genes (transcriptomic profiling)
-
Survival classes (based on overall survival time):
-
Class 0: 0–1 year (29 patients – high risk)
-
Class 1: 1–3 years (110 patients – intermediate risk)
-
Class 2: 3–5 years (50 patients – intermediate risk)
-
Class 3: >5 years (98 patients – low risk/long‑term survivors)
-
-
Class balancing: SMOTE applied to address severe imbalance (final: 110 patients per class; 440 total for ML training)
Key Gene Biomarkers Identified (Primary Set – 20 genes):
TEKT5, ZNF154, H2AC14, BX284668.6, MYCNOS, STUM, SERTM2, RPSAP18, REG4, PSCA, PAEP, ACTR3C, MSLN, MRPS18AP1, ISLR, IL37, IGLV3.16, H2BC11, GPR25, MTND4P35
Notable biomarkers with literature support in SKCM:
-
PSCA (Prostate Stem Cell Antigen): Upregulated in SKCM; positively correlated with RNA modification genes
-
PAEP (Progestogen‑Associated Endometrial Protein): Upregulated; correlates with poor survival outcomes
-
ACTR3C: Mutations (p.Gly58Arg, p.Gly58Glu) may alter cell signaling/motility in melanoma progression
-
MSLN: Downregulated in SKCM
-
ISLR: Downregulated; negatively correlated with tumor mutation load (TMB)
-
STUM: Biomarker for response to Nivolumab (immune checkpoint inhibitor)
-
SERTM2: Gene alterations frequently observed in melanoma (tumor progression, immune evasion)
LASSO Cox regression – 17 prognosis‑related genes:
| Gene | Coefficient | Risk/Beneficial | Known role in SKCM |
|---|---|---|---|
| ATP11A | 0.035 | Risk | — |
| B2M | -0.037 | Beneficial | Associated with worse OS (hazardous factor; immune evasion) |
| BISPR | -0.028 | Beneficial | — |
| CIB2 | 0.005 | Risk | — |
| CYTL1 | 0.082 | Risk | Progressively upregulated in normal skin → nevi → melanoma |
| GBP1P1 | -0.020 | Beneficial | — |
| GBP2 | -0.015 | Beneficial | Downregulated; low expression correlates with poor prognosis |
| GCA | -0.016 | Beneficial | — |
| HEXD | -0.042 | Beneficial | — |
| HLA‑DQB1 | -0.033 | Beneficial | Upregulated in cutaneous melanoma |
| KLRC1 | -0.028 | Beneficial | Associated with immune cell infiltration (NKG2A) |
| LRRK2.DT | -0.036 | Beneficial | — |
| MCOLN2 | -0.018 | Beneficial | Upregulated in other cancers (prostate); potential role in progression |
| SLC2A5 | -0.001 | Beneficial | Protective factor in SKCM, SARC, THCA |
| TTYH2 | 0.049 | Risk | Negatively correlated with OS (r = -0.23) |
| WIPF1 | -0.020 | Beneficial | Upregulated in early melanoma |
| XCL2 | -0.005 | Beneficial | High expression in SKCM |
Drug‑gene interactions (DGIdb): 17 drugs identified targeting 5 of the selected genes (mostly inhibitors). Detailed in Supplementary Table S9.
Correlation analysis (Pearson) – top correlated genes with OS:
| Positive correlation (higher expression → longer survival) | Negative correlation (higher expression → shorter survival) |
|---|---|
| CREG1 (r = 0.40, p = 3.63E-12) | AC008687.4 (r = -0.23, p = 8.28E-05) |
| PCGF5 (r = 0.38, p = 4.47E-11) | TTYH2 (r = -0.23, p = 0.00010) |
| VPS13C (r = 0.36, p = 2.06E-10) | G6PC3 (r = -0.21, p = 0.00025) |
| CPD (r = 0.35, p = 1.17E-09) | BOK (r = -0.21, p = 0.00033) |
Univariate Cox regression: 4,324 genes significantly associated with OS (p < 0.01).
-
1,264 genes – HR > 1 (risk factors)
-
3,060 genes – HR < 1 (protective factors)
GO enrichment (top 50 positively correlated genes):
-
Biological processes: lipid translocation (p = 0.00002), phospholipid translocation (p = 0.00006)
-
Cellular components: mitochondrial outer membrane (p = 0.0002), organelle outer membrane (p = 0.0004)
-
Molecular functions: cysteine‑type endopeptidase inhibitor activity (p = 0.001), protein serine/threonine kinase activity (p = 0.001)
Reactome pathways: Ion Transport By P‑type ATPases (p = 0.0003), Ion Channel Transport (p = 0.0009), plus RNA Polymerase I Promoter Opening, Packaging of Telomere Ends, Post‑translational Protein Modification, Metabolism of Proteins.
Model Performance (CatBoost – best performing classifier):
Primary biomarker set (20 genes):
| Dataset | Accuracy | AUC | Sensitivity | Specificity | Kappa | MCC |
|---|---|---|---|---|---|---|
| Training | 0.65 | 0.89 | 0.65 | 0.65 | 0.54 | 0.53 |
| Test | 0.68 | 0.90 | 0.68 | 0.68 | 0.58 | 0.58 |
Class‑wise AUC (Primary set – Test):
| Class | Survival range | AUC |
|---|---|---|
| Class 0 | 0–1 year (high risk) | 0.99 |
| Class 1 | 1–3 years | 0.83 |
| Class 2 | 3–5 years | 0.93 |
| Class 3 | >5 years (long‑term) | 0.84 |
Performance across all seven biomarker sets (Test – AUC):
| Biomarker Set | Accuracy | AUC | Kappa | MCC |
|---|---|---|---|---|
| Primary | 0.68 | 0.90 | 0.58 | 0.58 |
| Secondary | 0.68 | 0.89 | 0.58 | 0.58 |
| Third | 0.66 | 0.87 | 0.56 | 0.55 |
| Fourth | 0.60 | 0.85 | 0.48 | 0.47 |
| Fifth | 0.73 | 0.91 | 0.64 | 0.64 |
| Sixth | 0.56 | 0.84 | 0.41 | 0.41 |
| Seventh | 0.47 | 0.89 | 0.30 | 0.29 |
Best overall: Fifth biomarker set – AUC 0.91, Kappa 0.64, MCC 0.64
Class‑wise AUC across all seven sets (Test):
| Biomarker Set | Class 0 (0–1y) | Class 1 (1–3y) | Class 2 (3–5y) | Class 3 (>5y) |
|---|---|---|---|---|
| Primary | 0.99 | 0.83 | 0.93 | 0.84 |
| Secondary | 0.99 | 0.77 | 0.90 | 0.90 |
| Third | 0.98 | 0.80 | 0.93 | 0.73 |
| Fourth | 0.90 | 0.80 | 0.90 | 0.82 |
| Fifth | 0.99 | 0.85 | 0.91 | 0.90 |
| Sixth | 0.92 | 0.75 | 0.90 | 0.77 |
| Seventh | 0.95 | 0.82 | 0.93 | 0.85 |
Key observations:
-
Class 0 (high‑risk patients): Exceptional prediction across all sets (AUC 0.90–0.99) – clinically valuable for identifying patients needing immediate intervention
-
Class 2 (3–5 years): Consistently high AUC (0.88–0.93) – robust intermediate survival prediction
-
Class 1 (1–3 years): Moderate performance (AUC 0.75–0.85) – challenging mid‑range group
-
Class 3 (>5 years): Variable performance (AUC 0.73–0.90) – requires further refinement
Ensemble Models (Primary set – 20 genes):
| Ensemble Method | Base models | Meta‑model | Test AUC | Test MCC |
|---|---|---|---|---|
| Voting | RF + ET + LightGBM | Hard voting | 0.87 | 0.56 |
| Stacking | RF + ET | Logistic Regression | 0.88 | 0.61 |
| Stacking | RF + ET + XGB | CatBoost | 0.88 | 0.52 |
External validation (GEO dataset GSE65904):
| Biomarker Set | Matched genes | Training AUC | Test AUC |
|---|---|---|---|
| Primary set | 15 genes | 0.85 | 0.83 |
| Third set | 12 genes | 0.85 | 0.86 |
Class‑wise AUC on GSE65904 (Third set):
| Class | AUC |
|---|---|
| Class 0 (0–1y) | 0.79 |
| Class 1 (1–3y) | 0.78 |
| Class 2 (3–5y) | 0.90 |
| Class 3 (>5y) | 0.97 |
Comparison with existing methods:
| Method | Focus | AUC |
|---|---|---|
| Our study – Fifth set | Multiple independent gene sets | 0.91 |
| Our study – Primary set | Multiple independent gene sets | 0.90 |
| Yang et al. (2023) | Invasion‑associated genes (IAGS) | 0.88 |
| Geng et al. (2023) | NLR‑related genes | ~0.85 |
| Ding et al. (2023) | Chemokine‑related (14 genes) | ~0.84 |
| Ping et al. (2022) | Ferroptosis‑related (10 genes) | ~0.82 |
| Xiao et al. (2021) | Immune‑related lncRNAs (8 genes) | ~0.80 |
Our approach (multiple independent biomarker sets) outperforms existing single‑signature methods, with 4 out of 7 sets achieving AUC > 0.88.
Feature Selection Methods Evaluated:
| Method | Description |
|---|---|
| SVC‑L1 | Linear SVM with L1 penalty – primary method for biomarker identification |
| Recursive Feature Elimination (RFE) | Recursively ranks features, selects optimal subset |
| Sequential Feature Selection (SFS) | Sequentially adds/removes features to optimize performance |
| SelectKBest | Selects K most relevant features using f_classif |
Machine Learning Algorithms Evaluated:
CatBoost (best), Extra Trees (ET), Random Forest (RF), XGBoost (XGB), LightGBM, SVM, KNN, MLP, AdaBoost, Gradient Boosting, Logistic Regression
Data Curation & Quality Control:
-
Source: TCGA SKCM cohort via TCGAbiolinks (R) / Xena Browser
-
Samples: 287 patients with complete survival data
-
Normalization: Z‑score scaling (mean = 0, SD = 1)
-
Class definition: Based on overall survival time (months)
-
Class balancing: SMOTE (Synthetic Minority Over‑sampling Technique)
-
Validation split: Stratified 80/20 (5‑fold CV) – each class represented equally
-
External validation: GEO GSE65904 (independent melanoma cohort)
-
Multiple testing correction: Benjamini‑Hochberg FDR (p < 0.05)
Usage:
These datasets and models are designed for:
-
Identifying multiple independent prognostic biomarker sets for SKCM
-
Risk stratification of SKCM patients (high‑risk vs. long‑term survivors)
-
Training and benchmarking machine learning classifiers (CatBoost, RF, ET, XGB) for survival prediction
-
Drug target discovery (DGIdb analysis of prognostic genes)
-
External validation of biomarker signatures on independent melanoma cohorts
-
Understanding molecular determinants (lipid translocation, ion transport, mitochondrial function) of melanoma prognosis
Key Biological Insights:
-
CREG1, PCGF5, VPS13C – high expression associated with longer survival (positive correlation)
-
TTYH2, AC008687.4 – high expression associated with shorter survival (negative correlation)
-
Lipid/phospholipid translocation and mitochondrial outer membrane pathways enriched in positively correlated genes
-
Ion transport pathways (P‑type ATPases, ion channels) significantly enriched
Related Resources:
-
TCGA SKCM data: https://xenabrowser.net (GDC TCGA Melanoma)
-
GitHub: https://github.com/raghavagps/skcm_prognostic_biomarker
-
DGIdb (drug‑gene interactions): https://www.dgidb.org
License: CC BY 4.0 (as per Frontiers open access license)
Contact:
Prof. Gajendra P. S. Raghava
Files
raghavagps/skcm_prognostic_biomarker-v1.0.zip
Files
(279.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f39fa18fb5b71e46e61914d858f75a7c
|
279.8 MB | Preview Download |
|
md5:aad2a66c90d5ff979ed7a1bbec3e48a2
|
34.0 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/raghavagps/skcm_prognostic_biomarker/tree/v1.0 (URL)
Software
- Repository URL
- https://github.com/raghavagps/skcm_prognostic_biomarker