Published September 12, 2024 | Version v3
Journal article Open

D2Deep: Combining evolution and protein language models for cancer driver mutation prediction

  • 1. Interuniversity Institute of Bioinformatics, Vrije Universiteit Brussel (VUB) - Université Libre de Bruxelles (ULB)
  • 2. Interuniversity Institute of Bioinformatics, Vrije Universiteit Brussel (VUB) - Université Libre de Bruxelles (ULB), Brussels Interuniversity Genomics High Throughput Core (BRIGHTcore), VUB-ULB
  • 3. Interuniversity Institute of Bioinformatics, Vrije Universiteit Brussel (VUB) - Université Libre de Bruxelles (ULB), Structural Biology Brussels, VUB

Description

Datasets containing predictions, training and validation data for D2Deep predictor:

  • D2Deep_predictions: D2Deep predictions for mutations in cancer driver proteins included in Next Generation Sequencing (NGS) panel of biopsies of haematological and solid tumours from Compermed Guidelines (https://www.compermed.be/en/guidelines)
  • Features: Epistatic features that integrate evolutionary and co-evolutionary information and can be used to identify short- and long-range effects of mutations within proteins.
  • common_variants: common variants from gnomAD database (December 2022)
  • dbSNP: Single nucleotide polymorphisms (SNPs) from the Single Nucleotide Polymorphism database (dbSNP)
  • humsavar_benign_mutationsUniProtKB/Swiss-Prot human missense variants - release 21st December 2021
  • clinvar_benign_deleterious_missense: ClinVar missense variants (March 2023)
  • Tier.csv: Missense Tier 1,2,3 mutations from Catalogue of Somatic Mutations in Cancer (COSMIC - Cancer Mutation Census releasev92)
  • cgi.csv: Missense oncogenic mutations from Cancer Genome Interpreter (release 2018)
  • Balanced_training_set: Pathogenic/benign balanced set (on gene level) used for training the model
  • log_probWT_MUT_Tier1_2_3_common_balanced+-2_2200AA_57maxpool: Training set features used for model training
  • DMS_mutations: Deep Mutational Scanning mutations used for validation (2021 - https://doi.org/10.15252/msb.202110305)
  • DRGN_testset: DRGN test set used for validation
  • clinvar_balanced_somatic_germline_missense: Clinvar somatic versus germline subset used for validation (March 2023) 
  • 5genes_clinvarlabels_D2D_confidence_all: Performances of 6 predictors on 5 cancer genes mutations (March 2023)
  • TP53_expert_multiple_single_submitters, BRAF_expert_multiple_single_submitters, CHEK2_expert_multiple_single_submitters, AR_expert_multiple_single_submitters, PTEN_expert_multiple_single_submitters : ClinVar labels with Review status: Practice guideline, Expert panel, Multiple submitters, Single submitter (March 2023)
  • all_msas: mmseq2 Multiple Sequence Alignments for proteins used

 

                                ------------------------------------------------------------------------------------------------------------------------

                                               You can use our web server to query protein mutations and use the

                                                       interactive visualizations: https://tumorscope.be/d2deep/

                                ------------------------------------------------------------------------------------------------------------------------

Files

features.zip

Files (5.1 GB)

Name Size Download all
md5:0782ebea212d6650226c380830754f72
5.1 GB Preview Download

Additional details

Dates

Available
2023-11-17
https://www.biorxiv.org/content/10.1101/2023.11.17.567550v1