Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast
Creators
Description
Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast
This dataset contains mutation effect predictions for 22,169 Drosophila melanogaster protein isoforms, classifying over 293 million amino acid substitutions as neutral, uncertain, or impactful. The predictions were generated using the evolution-based GEMME model (E.Laine et al. MBE 2019) with multiple sequence alignments (MSAs) from the highly efficient ColabFold protocol (M.Mirdita et al. NatMet 2022, Abakarova et al. GBE 2023). To ensure reliability, we provide global (per-protein) and local (per-residue) confidence metrics, since the predictions are sensitive to the input MSA quality.
Predictions were validated using natural polymorphisms from the Drosophila Genetic Reference Panel (DGRP) and Drosophila Evolution over Space and Time (DEST2) datasets, as well as FlyBase’s developmentally lethal and hypomorphic mutations. Additionally, the dataset includes sensitivity data for post-translational modifications (PTMs) and short linear motifs (SLiMs), aiding functional site identification.
All this data can be visualized at proteocast.ijm.fr.
Readme:
- Drosophila_ProteoCast.tar.gz - this archive contains ProteoCast predictions and analysis for each unique proteoform, with folder names corresponding to IDs listed in the mapping_database.csv file. A detailed description of the folder structure and contents can be found in ReadMe.txt.
- data.tar.gz - this archive contains the data used in this study, sourced from FlyBase, DGRP2, and DEST2.
- Dmel6.44PredictionsRecap.csv - this summary file provides detailed information for each proteoform, all FlyBase protein IDs (FBpp_ID) included. It contains the following data:
- Identifiers: FlyBase protein ID (FBpp_ID), protein symbol (Protein_symbol), gene ID (FBgn_ID), transcript ID (FBtr_ID), and UniProt ID if available (UniProt_ID).
- Protein Characteristics: Sequence length (Length) and whether the proteoform is representative (Representative_FBpp).
- MSA and GEMME Predictions: Fraction of observed mutations (F_obs) and number of sequences (Nb_seq_MSA) in the ColabFold MSA, presence or absence of GEMME predictions (GEMME_prediction), and global confidence score (GlobalConfidence).
- Mutation Classification: Thresholds for defining mutations as neutral, uncertain, or impactful (GMM3_uncertain, GMM3_impactful).
- Genomic Information: DNA strand (Strand) and exon coordinates (Exons_coordinates).
- Structural Data: 3D structure file name if available (Structure_3D_file, Structure_3D).
- Mutation Counts: Number of analyzed mutations and affected residues, labelled as lethal, hypomorphic on FlyBase, or from the DEST2 and DGRP datasets (n_Lethal, n_Lethal_res, n_Hypomorphic, n_Hypomorphic_res, n_DEST2, n_DEST2_res, n_DGRP, n_DGRP_res, n_DEST_DGRP_union, n_DEST_DGRP_union_res).
- csv.tar.gz - this archive contains the files generated in this study. A detailed description of the folder structure and contents can be found in ReadMe.txt.
Files
Dmel6.44PredictionsRecap.csv
Additional details
Related works
- Is described by
- Publication: 10.1101/2025.02.09.637326 (DOI)
Funding
- Agence Nationale de la Recherche
- ADAGIO - Ageing and Natural DeAth GenetIc cOntrollers ANR-20-CE44-0010
- European Research Council
- PROMISE 101087830
Software
- Repository URL
- https://proteocast.ijm.fr/drosophiladb/
- Programming language
- Python, HTML, CSS
- Development Status
- Active