Published 2026 | Version v3
Dataset Open

Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast

  • 1. ROR icon Sorbonne Université
  • 2. ROR icon Inserm
  • 3. ROR icon Université Paris Cité

Description

Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast

This dataset contains mutation effect predictions for 22,169 Drosophila melanogaster protein isoforms, classifying over 293 million amino acid substitutions as neutral, uncertain, or impactful. The predictions were generated using the evolution-based GEMME model (E.Laine et al. MBE 2019) with multiple sequence alignments (MSAs) from the highly efficient ColabFold protocol (M.Mirdita et al. NatMet 2022, Abakarova et al. GBE 2023). To ensure reliability, we provide global (per-protein) and local (per-residue) confidence metrics, since the predictions are sensitive to the input MSA quality.

Predictions were validated using natural polymorphisms from the Drosophila Genetic Reference Panel (DGRP) and Drosophila Evolution over Space and Time (DEST2) datasets, as well as FlyBase’s developmentally lethal and hypomorphic mutations. Additionally, the dataset includes sensitivity data for post-translational modifications (PTMs) and short linear motifs (SLiMs), aiding functional site identification.

All this data can be visualized at proteocast.ijm.fr.  

Readme:

  1. Drosophila_ProteoCast.tar.gz - this archive contains ProteoCast predictions and analysis for each unique proteoform, with folder names corresponding to IDs listed in the mapping_database.csv file. A detailed description of the folder structure and contents can be found in ReadMe.txt. 
  2. data.tar.gz - this archive contains the data used in this study, sourced from FlyBase, DGRP2, and DEST2.
  3. csv.tar.gz - this archive contains the files generated in this study. A detailed description of the folder structure and contents can be found in ReadMe.txt
  4. ClinVar_data_analysis.tar.gz - this archive contains per-protein ProteoCast predictions for the ClinVar dataset of missense variants taken from ProteinGym.
  5. CAID3_data_analysis.tar.gz - this archive contains the results of the analyses performed on the CAID3 benchmark datasets, including variant effect predictions and performance metrics used to evaluate ProteoCast on independent binding-related datasets.
  6. PhyloHMM_data_analysis.tar.gz - this archive contains the PhyloHMM yeast binding-site dataset, along with the corresponding ProteoCast predictions and performance evaluations.

Files

Files (55.1 GB)

Name Size Download all
md5:49be5865cabb9da4efab352e12dfeec3
401.7 MB Download
md5:181055b8d687b9ec8e2962dbec41c730
417.1 MB Download
md5:ae14cc928e6f01eb33a8c650b4077a81
213.7 MB Download
md5:fb1a19bbe93da855c41dd6bd2b2eb6cd
570.4 MB Download
md5:b58f1dca87fac3905ed8228ed4178ae2
52.2 GB Download
md5:83c404f609f8159047c07dc042c76a3f
1.3 GB Download

Additional details

Related works

Is described by
Publication: 10.1101/2025.02.09.637326 (DOI)

Funding

Agence Nationale de la Recherche
ADAGIO - Ageing and Natural DeAth GenetIc cOntrollers ANR-20-CE44-0010
European Research Council
PROMISE 101087830

Software

Repository URL
https://proteocast.ijm.fr/drosophiladb/
Programming language
Python , HTML , CSS
Development Status
Active