Published April 29, 2024
| Version v1
Dataset
Open
Data for 'VespaG: Expert-guided protein language models enable accurate and blazingly fast fitness prediction'
Description
Datasets used for development of VespaG and VespaG predictions generated with https://github.com/JSchlensok/VespaG.
Uploads contain:
- Performance summaries for ProteinGym [1]:
- Spearman and Pearson correlation for VespaG: proteingym_performance_vespag.csv (columns: 'DMS_id', 'Spearman', 'Pearson')
- Spearman correlation for evaluated methods VespaG, GEMME [2], VESPA [3], TranceptEVE [4], AlphaMissense [5], PoET [6]: proteingym_spearman_allmethods.csv (columns: 'DMS_id', 'Trancept EVE-L', 'VESPA', 'VespaG', 'GEMME', 'AlphaMissense', 'PoET', 'UniProt_ID', 'coarse_selection_type' (function), 'taxon') - Fasta files with sequences for all train sets (vespag_fasta_training_datasets.zip with seq_all9k.fasta, seq_human5k.fasta, seq_droso4k.fasta, seq_ecoli2k.fasta, seq_virus1k.fasta) and test set (proteingym_217.fasta)
- VespaG Predictions for test set: vespag_proteingym_rawpreds_by_training_dataset.zip with raw_preds_ecoli.csv, raw_preds_human.csv, raw_preds_virus.csv, raw_preds_all.csv, raw_preds_droso.csv (columns: 'DMS_id', 'mutation', 'DMS_score', 'VespaG'). Predictions are based on different training data, the final model VespaG was trained on a subset of the human proteome and raw VespaG predictions for the ProteinGym benchmark are in raw_preds_human.csv (used to calculate the performances above).
- GEMME predictions for train sets: vespag_proteingym_rawpreds_by_training_dataset.zip with folders 'human', 'droso', 'ecoli', 'virus', 'all' for respective fasta file (each containing GEMME mutational landscape output files named 'ID' + '_normPred_evolCombi.txt')
- ESM-2 embeddings [7] for test set (proteingym_217_esm2.h5)
For details on VespaG see:
VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction
Celine Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine
bioRxiv 2024.04.24.590982; doi: https://doi.org/10.1101/2024.04.24.590982
For more information on data usage and generation please see https://github.com/JSchlensok/VespaG.
Abstract:
Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast single amino acid variant effect predictor, leveraging embeddings of protein Language Models as input to a minimal deep learning model. To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. Assessed against the ProteinGym Substitution Benchmark (217 multiplex assays of variant effect with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48 +/- 0.01, matching state-of-the-art methods such as GEMME, TranceptEVE, PoET, AlphaMissense, and VESPA. VespaG reached its top-level performance several orders of magnitude faster, predicting all mutational landscapes of the human proteome in 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).
[1] Notin, Pascal, et al. "ProteinGym: large-scale benchmarks for protein fitness prediction and design." Advances in Neural Information Processing Systems 36 (2024).
[2] Laine, Elodie, Yasaman Karami, and Alessandra Carbone. "GEMME: a simple and fast global epistatic model predicting mutational effects." Molecular biology and evolution 36.11 (2019): 2604-2619.
[2] Laine, Elodie, Yasaman Karami, and Alessandra Carbone. "GEMME: a simple and fast global epistatic model predicting mutational effects." Molecular biology and evolution 36.11 (2019): 2604-2619.
[3] Marquet, Céline, et al. "Embeddings from protein language models predict conservation and variant effects." Human genetics 141.10 (2022): 1629-1647.
[4] Notin, Pascal, et al. "TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction." bioRxiv (2022): 2022-12.
[5] Cheng, Jun, et al. "Accurate proteome-wide missense variant effect prediction with AlphaMissense." Science 381.6664 (2023): eadg7492.
[6] Truong Jr, Timothy, and Tristan Bepler. "PoET: A generative model of protein families as sequences-of-sequences." Advances in Neural Information Processing Systems 36 (2024).
[7] Lin, Zeming, et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science379.6637 (2023): 1123-1130.
Files
gemme_rawpreds_by_training_dataset.zip
Files
(2.2 GB)
Name | Size | Download all |
---|---|---|
md5:2bcdcf7183e1203779e72ac61196ac83
|
1.2 GB | Preview Download |
md5:d6eb5c0393041329db0aae192f593440
|
92.7 kB | Download |
md5:0aa1df362afedbe4d76f7e4edc9a5196
|
882.6 MB | Download |
md5:7ff52a6f863a07c6e45a141b4f7ba82b
|
14.4 kB | Preview Download |
md5:12befd1b740a934ec7c4b60aefcaf283
|
26.7 kB | Preview Download |
md5:c20de9f0b1f843406427a46303d898ce
|
4.9 MB | Preview Download |
md5:b14e2decf4186b6597712d1f28b8c80e
|
163.7 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.1101/2024.04.24.590982 (DOI)
Software
- Repository URL
- https://github.com/JSchlensok/VespaG
- Programming language
- Python