Published February 10, 2025 | Version v2
Dataset Open

Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast

  • 1. ROR icon Sorbonne Université
  • 2. ROR icon Inserm
  • 3. ROR icon Université Paris Cité

Description

Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast

This dataset contains mutation effect predictions for 22,169 Drosophila melanogaster protein isoforms, classifying over 293 million amino acid substitutions as neutral, uncertain, or impactful. The predictions were generated using the evolution-based GEMME model (E.Laine et al. MBE 2019) with multiple sequence alignments (MSAs) from the highly efficient ColabFold protocol (M.Mirdita et al. NatMet 2022, Abakarova et al. GBE 2023). To ensure reliability, we provide global (per-protein) and local (per-residue) confidence metrics, since the predictions are sensitive to the input MSA quality.

Predictions were validated using natural polymorphisms from the Drosophila Genetic Reference Panel (DGRP) and Drosophila Evolution over Space and Time (DEST2) datasets, as well as FlyBase’s developmentally lethal and hypomorphic mutations. Additionally, the dataset includes sensitivity data for post-translational modifications (PTMs) and short linear motifs (SLiMs), aiding functional site identification.

All this data can be visualized at proteocast.ijm.fr.  

Readme:

  1. Drosophila_ProteoCast.tar.gz - this archive contains ProteoCast predictions and analysis for each unique proteoform, with folder names corresponding to IDs listed in the mapping_database.csv file. A detailed description of the folder structure and contents can be found in ReadMe.txt. 
  2. data.tar.gz - this archive contains the data used in this study, sourced from FlyBase, DGRP2, and DEST2.
  3. Dmel6.44PredictionsRecap.csv - this summary file provides detailed information for each proteoform, all FlyBase protein IDs (FBpp_ID) included. It contains the following data: 
    • Identifiers: FlyBase protein ID (FBpp_ID), protein symbol (Protein_symbol), gene ID (FBgn_ID), transcript ID (FBtr_ID), and UniProt ID if available (UniProt_ID).
    • Protein Characteristics: Sequence length (Length) and whether the proteoform is representative (Representative_FBpp).
    • MSA and GEMME Predictions: Fraction of observed mutations (F_obs) and number of sequences (Nb_seq_MSA) in the ColabFold MSA, presence or absence of GEMME predictions (GEMME_prediction), and global confidence score (GlobalConfidence).
    • Mutation Classification: Thresholds for defining mutations as neutral, uncertain, or impactful (GMM3_uncertain, GMM3_impactful).
    • Genomic Information: DNA strand (Strand) and exon coordinates (Exons_coordinates).
    • Structural Data: 3D structure file name if available (Structure_3D_file, Structure_3D).
    • Mutation Counts: Number of analyzed mutations and affected residues, labelled as lethal, hypomorphic on FlyBase, or from the DEST2 and DGRP datasets (n_Lethal, n_Lethal_res, n_Hypomorphic, n_Hypomorphic_res, n_DEST2, n_DEST2_res, n_DGRP, n_DGRP_res, n_DEST_DGRP_union, n_DEST_DGRP_union_res).
  4. csv.tar.gz - this archive contains the files generated in this study. A detailed description of the folder structure and contents can be found in ReadMe.txt.

Files

Dmel6.44PredictionsRecap.csv

Files (56.8 GB)

Name Size Download all
md5:3088a4f15118e6e193a981ca4421cf8d
174.0 MB Download
md5:fd494667999a0f5a25a54677f7d5c4bc
570.4 MB Download
md5:dc856352818ef1e00d0571b29600e4b0
9.5 MB Preview Download
md5:81e10184ab3d69d65e37e57dc5cd992d
56.0 GB Download

Additional details

Related works

Is described by
Publication: 10.1101/2025.02.09.637326 (DOI)

Funding

Agence Nationale de la Recherche
ADAGIO - Ageing and Natural DeAth GenetIc cOntrollers ANR-20-CE44-0010
European Research Council
PROMISE 101087830

Software

Repository URL
https://proteocast.ijm.fr/drosophiladb/
Programming language
Python, HTML, CSS
Development Status
Active