Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast
Authors/Creators
Description
Proteome-wide Prediction of the Functional Impact of Missense Variants with ProteoCast
This dataset contains mutation effect predictions for 22,169 Drosophila melanogaster protein isoforms, classifying over 293 million amino acid substitutions as neutral, uncertain, or impactful. The predictions were generated using the evolution-based GEMME model (E.Laine et al. MBE 2019) with multiple sequence alignments (MSAs) from the highly efficient ColabFold protocol (M.Mirdita et al. NatMet 2022, Abakarova et al. GBE 2023). To ensure reliability, we provide global (per-protein) and local (per-residue) confidence metrics, since the predictions are sensitive to the input MSA quality.
Predictions were validated using natural polymorphisms from the Drosophila Genetic Reference Panel (DGRP) and Drosophila Evolution over Space and Time (DEST2) datasets, as well as FlyBase’s developmentally lethal and hypomorphic mutations. Additionally, the dataset includes sensitivity data for post-translational modifications (PTMs) and short linear motifs (SLiMs), aiding functional site identification.
All this data can be visualized at proteocast.ijm.fr.
Readme:
- Drosophila_ProteoCast.tar.gz - this archive contains ProteoCast predictions and analysis for each unique proteoform, with folder names corresponding to IDs listed in the mapping_database.csv file. A detailed description of the folder structure and contents can be found in ReadMe.txt.
- data.tar.gz - this archive contains the data used in this study, sourced from FlyBase, DGRP2, and DEST2.
- csv.tar.gz - this archive contains the files generated in this study. A detailed description of the folder structure and contents can be found in ReadMe.txt
- ClinVar_data_analysis.tar.gz - this archive contains per-protein ProteoCast predictions for the ClinVar dataset of missense variants taken from ProteinGym.
- CAID3_data_analysis.tar.gz - this archive contains the results of the analyses performed on the CAID3 benchmark datasets, including variant effect predictions and performance metrics used to evaluate ProteoCast on independent binding-related datasets.
- PhyloHMM_data_analysis.tar.gz - this archive contains the PhyloHMM yeast binding-site dataset, along with the corresponding ProteoCast predictions and performance evaluations.
Files
Files
(55.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:49be5865cabb9da4efab352e12dfeec3
|
401.7 MB | Download |
|
md5:181055b8d687b9ec8e2962dbec41c730
|
417.1 MB | Download |
|
md5:ae14cc928e6f01eb33a8c650b4077a81
|
213.7 MB | Download |
|
md5:fb1a19bbe93da855c41dd6bd2b2eb6cd
|
570.4 MB | Download |
|
md5:b58f1dca87fac3905ed8228ed4178ae2
|
52.2 GB | Download |
|
md5:83c404f609f8159047c07dc042c76a3f
|
1.3 GB | Download |
Additional details
Related works
- Is described by
- Publication: 10.1101/2025.02.09.637326 (DOI)
Funding
- Agence Nationale de la Recherche
- ADAGIO - Ageing and Natural DeAth GenetIc cOntrollers ANR-20-CE44-0010
- European Research Council
- PROMISE 101087830
Software
- Repository URL
- https://proteocast.ijm.fr/drosophiladb/
- Programming language
- Python , HTML , CSS
- Development Status
- Active